CRITICAL THINKING : understanding and evaluating dental research. [3ED.] 9780867158007, 086715800X

2,503 90 41MB

English Pages [419] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

CRITICAL THINKING : understanding and evaluating dental research. [3ED.]
 9780867158007, 086715800X

Table of contents :
Critical Thinking: Understanding and Evaluating Dental Research, Third Edition
Table of contents
Frontmatter
Table of contents
Dedication
Preface
Chapter 1: Reasons for Studying Critical Thinking
Critical Thinking
The Road to Publication
References
Chapter 2: Scientific Method and the Behavior of Scientists
The Behavior of Scientists
The Storybook Version of Scientific Method
Reproducibility and Openness
Positivism Versus Conventionalism
References
Chapter 3: The Components of a Scientific Paper
Active Reading and the Scientific Paper
Components and Their Functions
References
Chapter 4: Rhetoric
Classical Rhetoric
Audience
Abelsonʼs “MAGIC” Criteria for Persuasive Force
The Persuasion Palette
Summary
References
Chapter 5: Searching the Dental Literature
When to Seek Help
Information Sources
Open access, open science, and open data
Citation Analysis and Research Impact
Conducting a Search
Final Search Tips
Acknowledgment
References
Chapter 6: Logic: The Basics
Some Basic Standards and Ground Rules
The Assertability Question
Types of Logic
An Introduction to Deductive Logic
The Value of a Formal Analysis of Arguments
References
Chapter 7: Introduction to Abductive and Inductive Logic: Analogy, Models, and Authority
Abduction and Scientific Discovery
Induction
Implications
Summary of the Logic of Criticism of Inductive Arguments
References
Chapter 8: Causation
Hypotheses of Cause and Effect
Practical Assessment of Causation
A Threat to Establishing Causation via Association: Confounding Variables
Application of Criteria for Causality in Different Fields
Practical Suggestions for Assessing Causation in Scientific Papers
Ends and Means
References
Chapter 9: Quacks, Cranks, and Abuses of Logic
Three Approaches to Medical Treatment
Scientific Cranks
Quacks
Abuses of Logic
References
Chapter 10: Elements of Probability and Statistics, Part 1: Discrete Variables
Probability and Distributions
Statistical Inference
Goodness of Fit
Contingency Tables: An Especially Useful Tool in Evaluating Clinical Reports
Resources for Statistics
References
Chapter 11: Elements of Probability andStatistics, Part 2: Continuous Variables
The Normal Distribution
Confidence Intervals
Relationships Between the Normal, Poisson, and Binomial Distributions
Concluding Remarks
References
Chapter 12: The Importance of Measurement in Dentistry
Operational Definitions
Types of Scales
Units
References
Chapter 13: Errors of Measurement
Precision Versus Accuracy
Types of Error
References
Chapter 14: Presentation of Results
Ideals and Objectives
The Selection and Manipulation of Data
Significant Digits: Conventions in Reporting Statistical Data
Tables
Illustrations
References
Chapter 15: Diagnostic Tools and Testing
Clinical Practice
Principle of Diagnosis
The Process of Coming to a Diagnosis
Diagnosing, Screening, Surveillance
The Diagnostic Tool Versus the Gold Standard Diagnostic Tool
Reliability of Measurements
Diagnostic Tool Properties: Sensitivity and Specificity
Positive Predictive Value, Negative Predictive Value, and Diagnostic Error Rates
Calculating Postive Predictive Value and Negative Predictive Value
Receiver Operating Characteristic Analysis
SpIN and SnOUT
Returning to Clinical Decision Making
Conclusion
Acknowledgments
References
Chapter 16: Research Strategies and Threats to Validity
Constraints, Purposes, and Objectives
The Concept of Validity
Categories and Prevalence of Problems
References
Chapter 17: Observation
Observation-Description Strategy
The Value of Qualitative Methods in Dental Research
Observation-Description Strategyin Clinical Interventions
References
Chapter 18: Correlation and Association
Cross-sectional Survey
Ecologic Study
Case-Control Design
Follow-up (Cohort) Design
Scientific Standards in Correlational Experiments Involving Humans
Some Potentially Serious Problems in Determining Causation from Observational Data
References
Chapter 19: Experimentation
Independent and Dependent Variables
Uncontrolled
Requirements for a Good Experiment
Types of Research
Tactics of Experimentation
Tactical Considerations in Clinical Experiments
Typical Variables to Control or Consider in Biologic Research
References
Chapter 20: Experiment Design
Managing Error
Some Common Experiment Designs
References
Chapter 21: Statistics As an Inductive Argument and Other Statistical Concepts
Statistics ≠ Scientific Method
Statistical Inference Considered As an Inductive Argument
The Fallacy of Biased Statistics
Other Statistical Concepts
A Final Warning and a Set of Rules
Now, Something Completely Different: The Bayesian Approach to Induction
References
Chapter 22: Judgment
Clinical Versus Scientific Judgment
Critical Thinking As Applied to Scientific Publications
Putting It All Together: Argumentation Maps
Balanced Judgments
Judgments Under Uncertainty, Heuristics, and Cognitive Biases
Forming Scientific Judgments: The Problem of Contradictory Evidence
Using Othersʼ Judgments: Citation Analysis As a Means of Literature Evaluation
Influencing Judgment: Lower Forms of Rhetorical Life
References
Chapter 23: Introduction to Clinical Decision Making
Critical Thinking and Decision Making
Decision Making in the Context of Patient Care
Supplementation of Decision Tree Analysis with Other Approaches to Decision Making
References
Chapter 24: Exercises in Critical Thinking
Problems
Comments on Problems
Appendices
Appendix 1
Appendix 2
Appendix 3
Appendix 4
Appendix 5
Appendix 6
Appendix 7
Appendix 8
Appendix 9
Index

Citation preview

Critical Thinking

Understanding and Evaluating Dental Research Third Edition

Brunette-FM.indd 1

10/9/19 12:25 PM

Brunette-FM.indd 2

10/9/19 12:25 PM

Critical Thinking Understanding and Evaluating Dental Research Third Edition

Donald Maxwell Brunette, PhD Professor Emeritus Department of Oral Biological and Medical Sciences Faculty of Dentistry University of British Columbia Vancouver, Canada

With contributions by Ben Balevi, DDS, MSc Adjunct Professor Faculty of Medicine University of British Columbia

Helen L. Brown, MLIS, MAS, MA Reference Librarian University of British Columbia Library Vancouver, Canada

Private Practice Vancouver, Canada

Berlin, Barcelona, Chicago, Istanbul, London, Mexico City, Milan, Moscow, Paris, Prague, São Paulo, Seoul, Tokyo, Warsaw

Brunette-FM.indd 3

10/9/19 12:25 PM

Library of Congress Cataloging-in-Publication Data Names: Brunette, Donald Maxwell, author. Title: Critical thinking : understanding and evaluating dental research /     Donald Maxwell Brunette. Description: Third edition. | Batavia, IL : Quintessence Publishing Co, Inc,     [2019] | Includes bibliographical references and index.  Identifiers: LCCN 2019017538 (print) | LCCN 2019018451 (ebook) | ISBN     9780867158014 (ebook) | ISBN 9780867158007 (softcover) Subjects:  | MESH: Dental Research Classification: LCC RK80 (ebook) | LCC RK80 (print) | NLM WU 20.5 | DDC     617.6/027--dc23 LC record available at https://lccn.loc.gov/2019017538

© 2020 Quintessence Publishing Co, Inc Quintessence Publishing Co Inc 411 N Raddant Road Batavia, IL 60510 www.quintpub.com 5 4 3 2 1  ll rights reserved. This book or any part thereof may not be reproduced, stored in a A retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Editor: Zachary Kocanda Design: Sue Zubek Production: Sarah Minor Printed in Korea

Brunette-FM.indd 4

10/9/19 12:25 PM

Table of contents Dedication vii Preface viii

Reasons for Studying Critical Thinking

1

Scientific Method and the Behavior of Scientists The Components of a Scientific Paper Rhetoric

13

32

40

Searching the Dental Literature

55

with Helen L. Brown

Logic: The Basics

74

Introduction to Abductive and Inductive Logic: Analogy, Models, and Authority 84 Causation

97

Quacks, Cranks, and Abuses of Logic

110

Elements of Probability and Statistics, Part 1: Discrete Variables 128 Elements of Probability and Statistics, Part 2: Continuous Variables 145 The Importance of Measurement in Dentistry

Brunette-FM.indd 5

152

10/9/19 12:25 PM



Errors of Measurement

163

Presentation of Results

173

Diagnostic Tools and Testing

197

with Ben Balevi

Research Strategies and Threats to Validity Observation

232

245

Correlation and Association Experimentation Experiment Design

257

274 288

Statistics As an Inductive Argument and Other Statistical Concepts 301 Judgment

315

Introduction to Clinical Decision Making with Ben Balevi

Exercises in Critical Thinking Appendices Index

Brunette-FM.indd 6

333

356

375

397

10/9/19 12:25 PM

Dedication To the late Doug Waterfield, with whom I collaborated extensively once it became clear through the in vivo studies of Babak Chehroudi that macrophages played an important role in tissue response to the topography of implants. Doug was a tireless advocate for students in the Faculty of Dentistry and spared no effort to help them succeed.

vii

Brunette-FM.indd 7

10/9/19 12:25 PM

Preface The need for a third edition of Critical Thinking was driven by developments in dental and biomedical research conditions and practice, as well as reader comments. The third edition emphasizes the practical application of critical thinking in the production and evaluation of scientific publications. A number of issues are now more prominent. Thanks to the Internet, patients of today have much more information (both accurate and inaccurate) than the patients of yesteryear. The consequence to dentists is that they must explain and be prepared to justify their clinical decisions with evidence. This requires that dentists be capable of unearthing relevant information and evaluating alternatives by accepted criteria so that patients can be confident in their dentist’s recommendations. In line with the virtual revolution entailed in dealing with the ever-expanding scientific literature, the chapter on searching the dental literature now emphasizes scholarly publishing issues at all stages of research in addition to covering more traditional topics such as types of resources and searching techniques. The preparation of systematic reviews is discussed in greater detail, as are the strengths and weaknesses of various services such as Web of Science and Google Scholar as well as approaches to identifying relevant evidence in grey literature and its use in reviews. The author of the revised chapter is Helen Brown, who has taken over from Kathy Hornby, for whose contribution I continue to be grateful. Helen has master’s degrees in English literature and library science, along with frontline experience resolving student problems in literature searches. She has made significant changes in the emphasis of this chapter. For example, the increasing trend toward open access publications and open science initiatives is addressed, providing important information for both prospective authors and readers with limited access to expensive subscription-based resources. The strengths and weaknesses of the h-index, now widely used

by researchers to demonstrate the impact of their work, is also now considered, and there is updated information on impact factors of dental journals and other bibliographic tools. A second new contributor is Ben Balevi, who takes over the chapter on diagnostic tests and measures from my former student Carol Oakley, whose contributions continue to influence the current chapter. Ben has a master’s degree in evidence-based dentistry and has been recognized for his contributions in this area by the American Dental Association. There have been considerable changes in other aspects of the book as well. A new chapter discusses the function of each section of the scientific paper so that researchers can check that their papers are fulfilling the expectations of readers and referees. The coverage on logic has expanded to include more material on informal logic, for example, the use of Walton’s critical questions for evaluating causation. The informal logic used in the refereeing process in which the burden of proof shifts from authors to reviewers and back has been added to complement the argumentation maps found in the chapter on judgment that features tactical details such as rebuttals, caveats, and supporting evidence. The intent is to give investigators a practical approach to anticipating and dealing with criticisms. The chapter on judgment now includes materials of both historical (Darwin’s decision to marry and Benjamin Franklin’s moral algebra) and novel decision-making heuristics, such as fast frugal heuristics. Clinical decision making has been given its own chapter and expanded to include detailed, workedout calculations and clinical scenarios. In addition to decision tree analysis, two elements of central importance to patients are presented with workedout calculations: cost-effectiveness analysis and willingness to pay analysis. The chapter on diagnostic tools and testing has been expanded to include new worked-out calculations and examples, as well as illustrations

viii

Brunette-FM.indd 8

10/9/19 12:25 PM

to clarify this topic, which many find difficult and confusing. The chapter on experimentation now covers randomized field trials, a method employed in causally dense environments such as business problems, often combined with trial-and-error approaches. The technique is evolving but has been found useful by organizations such as Google that run thousands of randomized field trials. The statistics chapter has expanded coverage of Bayesian statistics, as this approach is expected to become more widely used. The chapter on presenting results contains new illustrations and more extensive discussion of the integrity-based principles for displaying information emphasized by Tufte. The section on the behavior of scientists has been expanded with discussion on the reproducibility crisis now shaking some scientists’ faith in the reliability of reported results, even including those published in high-impact journals. All chapters have also been improved by the addition of updates, revisions, or examples. I acknowledge the support of my colleagues who have never hesitated to provide advice in their respective domains of expertise. I also thank my former students, friends, and collaborators, whose insights

inform numerous sections. Regular University of British Columbia (UBC) contributors in this way include Babak Chehroudi, Tim Gould, Hugh Kim, Chris Overall, Mandana Nematollahi, Eli Konorti, and Nick Jaeger (Electrical and Computer Engineering). Colleagues outside UBC, Christopher McCulloch (University of Toronto), Ken Yaegaki (Nippon Dental University), and Doug Hamilton (Western University), have also provided advice and encouragement. I also thank my collaborators at the Swiss Institute of Technology (Zurich), Professors Marcus Textor and Nicolas Spencer, and my former post-doc Dr Marco Wieland, for continuing my education in surface science. Of course, I would be an ingrate if I failed to acknowledge the continuing support of my wife Liz, sons Max and Regan (and their partners Heather and Malizza), who provided various examples used in the book including the exploits of my problem-solving grandson Calixte and developing gymnast granddaughter Genèvieve. I also thank Quintessence for their ongoing support through three editions of this work. This edition has been facilitated through the efforts of Bill Hartman and Bryn Grisham. Zach Kocanda has been exemplary in his attention to detail, often under trying circumstances, in editing the manuscript.

ix

Brunette-FM.indd 9

10/9/19 12:25 PM

Brunette-FM.indd 10

10/9/19 12:25 PM

1 Reasons for Studying Critical Thinking

Critical Thinking Critical thinking has been defined many ways, from the simple—“Critical thinking is deciding rationally what to or what not to believe”2—to the more detailed “Critical thinking is concerned with reason, intellectual honesty, and open-mindedness, as opposed to emotionalism, intellectual laziness, and closed-mindedness”3—to the nearly comprehensive: Critical thinking involves following evidence where it leads; considering all possibilities; relying on reason rather than emotion; being precise; considering a variety of possible viewpoints and explanations; weighing the effects of motives and biases; being concerned more with finding the truth than with being right; not rejecting unpopular views out of hand; being aware of one’s own prejudices and biases; and not allowing them to sway one’s judgment.3 Self-described practitioners of critical thinking range from doctrinaire postmodernists who view the logic of science with its “grand narratives” as inherently subordinating4 to market-driven dentists contemplating the purchase of a digital impression scanner. In this book, critical thinking, and in particular the evaluation of scientific information, is conceived as “organized common sense” following Bronowski’s view of science in general.5 Of course, common sense can be quite uncommon. A secondary use of the term critical thinking implies that common sense involves a set of unexamined and erroneous assumptions. For example, prior to Galileo, everyone “knew” that heavy objects fell faster than lighter ones. Critical thinking as organized common sense takes the systematic approach of examining assumptions. The professional use of critical thinking is particularly complex for dental professionals because they live in two different worlds. On the one hand, they are health professionals treating patients who suffer from oral diseases.



It has happened more than once that I found it necessary to say of one or another eminent colleague, ʻHe is a very busy man and half of what he publishes is true but I donʼt know which half.ʼ” ERWIN CHARGAFF 1

1

Brunette-CH01.indd 1

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

On the other hand, dentists typically also inhabit the business world, where decisions may be based on the principle of maximizing income from their investment. Dental practice is based only very loosely on responding to disease6; less than one-third of patient visits result in identifying a need for restorative care.7 Twenty percent of work is elective, such as most of orthodontics, tooth whitening, and veneers, and typically that work comprises the most lucrative aspects of practice. Thus, the information that must be evaluated in performing these disparate roles covers the spectrum from advertisements to financial reports to systematic meta-analysis of health research. Dentists are health professionals, people with specialized training in the delivery of scientifically sound health services. The undergraduate dental curriculum is designed to give dental students the basic knowledge to practice dentistry scientifically, at least to the extent allowed by the current state of knowledge. But if any guarantee can be made to dental students, it is that dentistry will change, because the knowledge base of biomedical and biomaterial sciences grows continually. Most dentists today have had to learn techniques and principles that were not yet known when they were in dental school. In the future, as the pace of technologic innovation continues to increase and the pattern of dental diseases shifts, the need to keep up-to-date will be even more pressing. Means of staying current include interacting with colleagues, reading the dental literature, and attending continuing education courses—activities that require dentists to evaluate information. Yet, there is abundant historical evidence that dentists have not properly evaluated information. Perhaps the best documented example in dentistry of a widely accepted yet erroneous hypothesis is the focal infection theory. Proposed in 1904 and accepted by some clinicians until the Second World War, this untested theory resulted in the extraction of millions of sound teeth.8 But errors are not restricted to the past; controversial topics exist in dentistry today because new products or techniques are continually introduced and their usefulness debated. Ideally, dentists should become sophisticated consumers of research who can distinguish between good and bad research and know when to suspend judgment. This goal is different from proposing that dentists become research workers. One objective of this book is to provide the skills enabling a systematic method for the evaluation of scientific papers and presentations. A marked addition to the challenges of dental practice in recent years is that patients have increased access through the Internet to information as well as

misinformation. Dentists thus are more likely to be questioned by patients on proposed treatment plans and options. In responding to such questions, it is clearly advantageous for dentists to be able to present a rational basis for their choices. Chapter 23 covers an evidence-based approach to clinical decision making and appendix 9 provides a template for dental offices to use in documenting their decisions based on recent evidence. A systematic approach to analyzing scientific papers has to be studied, because this activity requires more rigor than the reasoning used in everyday life. Faced with an overabundance of information and limited time, most of us adopt what is called a make-sense epistemology. The truth test of this epistemology or theory of knowledge is whether propositions make superficial sense.9 This approach minimizes the cognitive load and often works well for day-to-day short-term decision making. In 1949, Zipf of Harvard University published Human Behaviour and the Principle of Least Effort, in which he stated: The Principle of Least Effort means, for example, that in solving his immediate problems he will view these against a background of his probable future problems, as estimated by himself. Moreover, he will strive to solve his problems in such a way as to minimize the total work that he must expend in solving both his immediate problems and his probable future problems.10 Zipf used data from diverse sources ranging from word frequencies to sensory sampling to support his thesis. Although the methods and style of psychologic research have changed, some more recent discoveries, such as the concept of cognitive miser in studies of persuasion,11 coincide with Zipf’s principle. Kahneman in Thinking, Fast and Slow has elevated the principle to a law noting that we “conduct our mental lives by the law of least effort.”12 In science, the objective is not to make easy shortterm decisions but rather to explain the phenomena of the physical world. The goal is accuracy, not necessarily speed, and different, more sophisticated, more rigorous approaches are required. Perkins et al9 have characterized the ideal skilled reasoner as a critical epistemologist who can challenge and elaborate hypothetical models. Where the makes-sense epistemologist or naive reasoner asks only that a given explanation or model makes intuitive sense, the critical epistemologist moves beyond that stage and asks why a model may be inadequate. That is, when evaluating and explaining,

2

Brunette-CH01.indd 2

10/9/19 10:27 AM

Critical Thinking

Table 1-1 | Level of evidence guideline recommendations of the United States Agency for Healthcare Research and Quality Level

Grade of recommendation

Type of study

1

Supportive evidence from well-conducted RCTs that include 100 patients or more

A

2

Supportive evidence from well-conducted RCTs that include fewer than 100 patients

A

3

Supportive evidence from well-conducted cohort studies

A

4

Supportive evidence from well-conducted case-control studies

B

5

Supportive evidence from poorly controlled or uncontrolled studies

B

6

Conflicting evidence with the weight of evidence supporting the recommendation

B

7

Expert opinion

C

RCTs, randomized controlled trials.

the critical epistemologist asks both why and why not a postulated model may work. The critical epistemologist arrives at models of reality, using practical tactics and skills and drawing upon a large repertoire of logical and heuristic methods.9 Psychologic studies have indicated that everyday cognition comprises two sets of mental processes, System 1 and System 2, which work in concert, but there is some debate whether they operate in a parallel or sequential manner. System 1 operates quickly and effortlessly, whereas System 2 is deliberate and requires attention and effort.12 System 2 is a rule-based system, and engaging System 2 is the surest route to fallacy-free reasoning.13 System 2 becomes engaged when it catches an error made by the intuitive System 1. The good news is that extensive work by Nisbett and colleagues (briefly reviewed by Risen and Gilovich13) showed that people can be trained to be better reasoners and that people with statistical backgrounds were less likely to commit logical fallacies. Nisbitt and colleagues further demonstrated that even very brief training was effective in substantially reducing logical and statistical errors. Thus this book has chapters on logic and statistics. A second objective of the book is to inculcate the habits of thought of the critical epistemologist in readers concerned with dental science and clinical dentistry.

The scope of the problem In brief, the problems facing anyone wishing to keep up with developments in dentistry or other health professions are that (1) there is a huge amount of literature, (2) it is growing fast, (3) much of it is useless in terms

of influencing future research (less than 25% of all papers will be cited 10 times in all eternity,14 and a large number are never cited at all), and (4) a good deal of the research on a clinical problem may be irrelevant to a particular patient’s complaint The actual rate of growth of the scientific literature has been estimated to be 7% per year of the extant literature, which in 1976 comprised close to 7.5 million items.11 This rate of growth means that the biomedical literature doubles every 10 years. In dentistry, there are about 500 journals available today.15 Many dental articles are found in low-impact journals, but, ignoring these, there were still 2,401 articles published in 1980 in the 30 core journals.16 More recently, it has been estimated that about 43,000 dental-related articles are published per year. However, the problem is not intractable. Relman,17 a former editor of the New England Journal of Medicine, believes that most of the important business of scientific communication in medicine is conducted in a very small sector of top-quality journals. The average practitioner needs to read only a few well-chosen periodicals.17 The key to dealing with the problem of the information explosion is in choosing what to read and learning to evaluate the information. Dentists are exposed to diverse information sources, and the important issues vary depending on the source. For example, a dentist may wish to determine whether potassium nitrate toothpastes reduce dentin hypersensitivity. One approach would be to look up a systematic review on this topic in the Cochrane Library,18 which many regard as the highest level in the hierarchy of evidence (Table 1-1). The skills required to understand the review would include a basic knowledge of statistics and research design. The same dentist, facing

3

Brunette-CH01.indd 3

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

the competitive pressures of his or her local market, might also want to determine whether a particular laser-bleaching process should be adopted for the practice. In that instance, there might not be a relevant Cochrane review, and there may not even be a relevant paper in a refereed journal to support a decision. Available evidence might consist of advertising brochures and anecdotes of colleagues. The dentist may have to employ a different set of skills, ranging from evaluating the lie factor in graphics (see chapter 14) to disentangling rhetoric from fact. Advertisements and salesmanship are persuasive exercises; the chapter on rhetoric (chapter 4) deals with means of persuasion. Typically, dentists acquire information on innovative procedures through participation in networks in which their colleagues supply informal data on the effectiveness of the innovations. Nevertheless, dentists cite reading peer-reviewed dental literature and experimental studies as the gold standard for determining the quality of innovations.19 New technology is often introduced into their practices through trial and error; dentists take the pragmatic approach of directly determining what works in their hands in their practice.19 Doubtless, some of the personal and financial expenses typical of the trial-and-error approach could be reduced with more effective evaluation of information prior to selecting a material or technique for testing. This book focuses on evaluating refereed scientific papers, but many of the issues of informational quality and questions that should be asked apply equally to other less formal channels of communication.

What is a scientific paper? The Council of Biology Editors defines a scientific paper as follows: An acceptable primary scientific publication must be the first disclosure containing sufficient information to enable peers (1) to assess observations; (2) to repeat experiments; and (3) to evaluate intellectual processes; moreover, it must be sensible to sensory perception, essentially permanent, available to the scientific community without restriction, and available for regular screening by one or more of the major recognized secondary services.20 Similar ideas were stated more succinctly by DeBakey,21 who noted that the contents of an article should be new, true, important, and comprehensible.

A good deal of the literature circulated to dentists does not meet these requirements. But even excluding the throwaway or controlled-circulation magazines that are little more than vehicles for advertisements, the amount of information published annually appears formidable. One approach to dealing with a large number of papers is to disregard original papers and receive information secondhand. Dental and medical journals present reviews of current research in specific clinical or scientific fields; some journals, such as Dental Clinics of North America and the Journal of Evidence Based Dentistry, are exclusively devoted to this approach. Although this tactic reduces the volume of literature to be covered, it does not solve the problem of evaluating the information contained in the reviews. To perform this task effectively, a researcher must be able to assess the soundness of the reviewer’s conclusions. In deciding to accept information secondhand, the researcher is also deciding whether the author of the review is a reliable, objective authority. Thus, the problem of evaluation has been changed, but not eliminated. This book focuses on the primary literature, where it is hoped that new, true, important, and comprehensible information is published. The systematic review, a relatively new review form, attempts to deal with some of the more glaring problems of traditional reviews and is covered briefly in chapter 5. Although useful for some purposes, the systematic review has its own shortcomings, and the researcher must judge how these affect the conclusions. Journals vary in quality; chapter 5 also discusses bibliometric approaches of ranking journals. In the following section, I present a brief review of how articles get published that may help explain some of this variation.

The Road to Publication The author The author’s goal is to make a significant contribution to the scientific literature: a published paper. To accomplish that goal, the author will have to produce a submission for publication whose contents are new, true, important, and comprehensible. Moreover, the author wants to publish the paper in a journal whose readers will likely find the paper of interest and hopefully be influenced by it. As journals vary in the rigor they demand and the length of papers they accept, the author needs to identify the best journal for his or her purposes.

4

Brunette-CH01.indd 4

10/9/19 10:27 AM

The Road to Publication

Refereed versus nonrefereed journals The first hurdle faced by an article submitted for publication is an editor’s decision on the article’s suitability for the journal. Different journals have different audiences, and the editors are the arbiters of topic selection for their journal. Editors can reject papers immediately if they think the material is unsuited to their particular journal. In some journals, acceptance or rejection hinges solely on the opinion of the editor. However, this method is problematic because informed decisions on some papers can only be made by experts in a particular field. Therefore, as a general rule, the most highly regarded journals ask the opinion of such specialists, called referees or editorial consultants. Referees attempt to ensure that a submitted paper does not demonstrably deviate from scientific method and the standards of the journal. Whether a journal is refereed can be determined by consulting Ulrich’s Periodicals Directory (ulrichsweb. com). Editors usually provide referees with an outline of the type of information that they desire from the referee. The criteria for acceptance will necessarily include both objective (eg, obvious errors of fact or logic) and subjective (eg, priority ratings) components. Unfortunately, the task of refereeing is difficult and typically unpaid. Refereeing is often squeezed in among other academic activities, so it should not be surprising that it sometimes is not done well and that referees often disagree. Studies of the reliability of peer-review ratings are disappointing for readers wanting to keep faith in the peer-review system. Reliability quotients, which can range from 0 (no reliability) to 1 (perfect reliability), for various attributes of papers submitted to a psychology journal22 follow: • Probable interest in the problem: 0.07 • Importance of present contribution: 0.28 • Attention to relevant literature: 0.37 • Design and analysis: 0.19 • Style and organization: 0.25 • Succinctness: 0.31 • Recommendation to accept or reject: 0.26 Despite such issues, there is evidence that the review process frequently raises important issues that, when resolved, improve the manuscript substantially.23 After consulting with referees, the editor decides whether the paper should be (1) published as is—a comparatively rare event; (2) published after suitable revision; or (3) rejected. Journals reject papers in proportions varying from 0% to greater than 90%.

The literature available to dental health professionals ranges the spectrum of refereed to nonrefereed, from low (or no) rejection rates to high rejection rates. The Journal of Dental Research, for example, used to have a 50% rejection rate (Dawes, personal communication, 1990), but that has risen so that 25 years later some 90% of submissions are rejected.24 Even among high-impact journals, however, there is no guarantee that the referees did a good job. In fact, these considerations only serve to reinforce the view “caveat lector”—let the reader beware.

Editors and referees The editor of the journal and the referees are the “gatekeepers” who decide whether a manuscript is accepted. In science the basic rule appears to be something akin to “if it doesn’t get published, it doesn’t exist.” Thus the rewards in science go to those who publish first, not the first scientist to observe a phenomenon. Obviously, pleasing these gatekeepers is essential to a scientific career. The editor and the referees are the representatives of the readers of the journal. They protect the readers from wasting their time on obviously erroneous or uninteresting or unsuitable or unoriginal or opaque or trivial submissions. The papers in the journal must be relevant to the readership. An important part of the editor’s job is to protect authors from unjust criticism that can arise from such things as personal animosity between an author and a referee or an attempt by a referee to block publication of a competitor’s work. Unfortunately, the scientists who are best able to evaluate a submission may be individuals who can suffer most from its publication, as for example occurs when the referee’s own work is “scooped” (ie, published earlier by a competitor). To justify readers’ expenditure of time the paper should address a significant problem or concern and provide a significant amount of information. The length of journal articles varies; some journals publish “letters” rather than full-length papers for interesting but only briefly developed findings. Editors are interested in publishing papers that are likely to be cited in the future or, expressed another way, are building blocks for future research or clinical application. Tacker25 notes that journals differ in the sophistication of their readership. A general medical journal (eg, JAMA) is written at the comprehension level of a thirdyear medical student, whereas a specialty journal is written for a first- or second-year resident. A scientific

5

Brunette-CH01.indd 5

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

journal should be understandable to third- or fourthyear PhD candidates or above in the general field.

The editor The editor decides ultimately whether to accept or reject a submission. As a general rule the editor is an unpaid (or lowly paid) volunteer of distinguished standing in the field covered by the journal. The editor defines the scope of the journal (ie, what subjects are published in it), and if a manuscript falls outside the journal’s mandate, it will probably be returned promptly to the author. Similarly, an editor may reject a paper on the grounds that a submission does not advance the field sufficiently or has a low potential for future impact. Such judgments are subjective but nevertheless may need to be made. I call this the “de gustibus” standard after the Latin adage, De gustibus non est disputandum: “In matters of taste, there can be no disputes.” As the adage indicates, if a decision is made on this basis it will be difficult to persuade the editor to reverse it. Editors are often responsible for diverse other tasks such as recruiting referees and persuading them to submit their reviews in a timely manner. Some journals have associate editors who oversee submissions in their area of expertise, and the editor must coordinate their activities as well as consult with editorial boards and deal with the various business matters. Despite the importance of their job, editors are not always appreciated by their colleagues, who may resent some decisions. Chernin playfully suggests, “Editors are also the people who separate the wheat from the chaff and frequently publish the chaff.”26 After the manuscript is accepted by the editor, it may be passed on to a managing editor to take the manuscript though the production and publication process. Day27 states that editors and managing editors have jobs that are made impossible by the attitudes of authors who submit to their journals. For example, authors might ignore the rules and conventions specified by the journal (eg, the format for the references). Or authors and referees may have irreconcilable views, and the editor may be caught in the middle. Given that the editor’s decision could affect the author’s career, it is clearly wise not to irritate editors or referees, but rather to make their job in dealing with the submission as easy as possible. That is, authors want the editor to like them, and as has been extensively studied in the psychology literature,28 liking can be a key factor in persuasion, in this case persuading the editor that the submission should be published.

An indicator of what editors want is provided by the instructions given to referees of journals, often in the form of a checklist or a score sheet that incorporates specific questions for reviews completed online. As an example, I compiled an indicator of some of editors’ concerns by simply looking at the instructions sent to me by ten journals. The following characteristics were emphasized: • • • • • • •

90% (ie, 9/10) concise 70% clear 70% evaluate by section (eg, introduction, methods) 70% adequacy of references 60% originality 60% adequacy of illustrations 50% relationship of conclusions to results

Overall the instructions emphasize economy of expression, ignoring the folk wisdom that “Sermons on brevity and chastity are about equally effective.”26 Nevertheless it is useful for prospective authors to obtain a specific checklist for the journal to which they are submitting so that they can attempt to meet the journal’s expectations.

The referees The referees are unpaid volunteers; nevertheless they do receive some nonmonetary rewards. They get first access to new information in a field that interests them, and their decisions can influence the direction of that field. On occasion that information may be useful—for example, a submission could contain a reference of which the referee was unaware or a new technique that might be beneficial to the referee’s own research, or reading the article might prompt an idea for the referee’s future research. Finally, in doing a favor to the editor in refereeing a manuscript, the referee might acquire a store of goodwill that might help when his or her own manuscript is submitted to the journal. (Another well-accepted persuasive factor—reciprocation).28 Nevertheless, refereeing papers is a low-yield task—the referees’ efforts help the editor and those whose papers are published, but the referee typically gets no tangible benefit save the good feeling that comes from doing the right thing. Spending their own time on work for which others will benefit is bound to lead to resentment if those potentially benefitted make the task more difficult than it need be. The applicable golden rule then is to do unto the referees as you would have them do unto you. Make it easy for the referees in the hope they will make it easy for you. In

6

Brunette-CH01.indd 6

10/9/19 10:27 AM

The Road to Publication

this spirit then authors should attempt to meet the expectations of referees, in particular not wasting their time. In general, referees expect a scientific writing style characterized by the following qualities: • Objectivity. Data obtained from scientific observation should be capable of being repeated by any competent observer, and the interpretations should be similarly identical among investigators. Expressed another way, in the “Storybook” version of scientific method, there is no room in science for subjective personal data collection and interpretation. Sometimes writers attempt to emphasize their objectivity, and this desire to appear objective can lead to overuse of the passive voice. Of course investigators do have an axe to grind, as they want to be published so that they can reap the rewards of publication—recognition and employment being the chief among these. So a tradition has arisen whereby authors attempt to appear to be objective while being strong advocates for their position. Thus, authors make “Verbal choices . . . that capitalize on a convenient myth . . . reason has subjugated the passions.”29 In any case, readers have come to expect that scientific writers will present at least a veneer of objectivity (practitioners of qualitative methods might disagree), but readers have other expectations of authors as well. • Logic. Logic not only in organization but in sentence structure and meeting reader expectations.30 • Modesty. Related to the scientific norm of humility (extravagant claims will attract close and probably critical attention). • Clarity. Scientific writers should follow the common advice to all writers such as avoiding misplaced modifiers, dangling participles, nonparallel constructions, stacked modifiers, etc. (There are numerous books on writing style, such as Strunk and White’s The Elements of Style31 or Zinnser’s On Writing Well.32) • Precision. Use of precise terminology to avoid confusion and the fallacy of equivocation. • Brevity. To conserve readers’ time. • Justified reasoning. Making the reason for statements clear by referring to data in the paper (eg, “see Figure 1”) or references to the literature. • Use of signposts (eg, “our first objective…,” “our second objective…”), linkage, etc. Referees typically submit their reports by filling out forms online often accompanied by explanatory remarks in an uploaded text file. Typically the form starts with what might be called high-level assessments—questions like accept

or reject, priority, overall length of the paper. More detailed points are given in the comments to authors or the editor or editor’s assistant. The confidential comments to the editor give the referee the possibility to offer frank criticism that might be construed by some authors as being insulting. For example, a referee might comment that the paper is poorly written and needs revision by a native English speaker, and such a comment might be insulting to an author who was in fact a native English speaker.

Referees versus authors Typically referees make critical comments on the papers they are reviewing ranging from the easily correctable, such as typographic errors or formatting, through more difficult problems to correct, such as lack of clarity in organization, deficiencies such as inappropriate methodology, or erroneous logic, that lead to unsupported conclusions. Typically the referees will number their comments, and the editor will require the author to address each of them. So in effect the authors and each of the referees enter into a debate presided over by the editor, who might also provide some comments, that might be classed according to the conventions of informal logic or pragmatics as a “persuasion dialogue.”33 The participants are obligated to give helpful and honest replies to their opponents’ questions. In theory each participant in the dialogue is supposed to use arguments exclusively composed of premises that are commitments of the other participant. But in argumentation, as in life, commitments are notoriously difficult to extract from an opponent, and pretty much the best one can hope for is plausible commitment to an opinion based on reasoned evidence. In conducting the argument the participants are also obligated with a burden of proof, which shifts from one to the other during the dialog. For example, in submitting the paper the author, as proponent, assumes the burden of proof for the conclusions of the paper, and the components of the paper (ie, methods, data, figures, tables, and logic), constitute the means of bearing that burden. Similarly the referee, in making a criticism, assumes the burden of proof of justifying the criticism. This may be done by various means such as citing deficiencies in the evidence in the paper, external scientific evidence (such as previously published papers), or expected standards in the field of study. The editor forwards the referees’ criticisms along with a preliminary decision to the authors who, if they want the submission to proceed to publication, are

7

Brunette-CH01.indd 7

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

expected to bear the burden of proof in responding to the criticisms. This dialogue can be carried over several cycles. Often in my experience, it seems that referees seldom accept or commit to the author’s arguments; rather they merely concede by terminating discussion. In science, as in life, it is difficult to say “Sorry, I was wrong.” In some instances agreement between the referees and the authors is never achieved, but the issues are clarified to an extent that the editor can make a decision. The question arises of the logic used by editors in making their decisions. First it should be noted that different types of reasoning employ different standards of proof, and this is not unusual in human affairs. In law for example the standard of proof in criminal cases is “beyond reasonable doubt” whereas civil cases are decided on the “balance of probabilities.” Scientific arguments can be complex and may entail various forms of logic ranging from the certainty of deductive logic employed in mathematics to inductive logic, which can deal with calculated probabilities, to informal logic that balances many factors but does not necessarily proceed by strict numeric calculation so that the conclusions are classed qualitatively in terms of their relative plausibility. Perhaps the reasoning process most employed by editors, who have to make a practical decision, would be the pragmatic model devised by the philosopher Stephen Toulmin34 (see also chapter 22 for more on argumentation maps), which specifies a system for scientific explanation that includes Claims (such as conclusions in the paper) justified by Evidence and Warrants. A Warrant is the means that connects the Claim to the Evidence; it may be, for example, a scientific principle or a connection established by previous work. An important aspect of Toulmin’s approach is that it is field dependent so that appropriate standards are employed for differing types of scientific endeavor. One can see this aspect in action in the scientific literature by observing the content of papers where the rigor of the methods, the quantity of data, or the articulation of the findings differ among different fields of science or among the journals within one field of science. It is the editor who determines the standards of his/ her journal, and differences between editors in what they consider important findings or flaws can result in a paper rejected by an editor of one journal being accepted by another one. There are other elements in the Toulmin model, including the Rebuttal arguments that restrict or counter the claim and the Qualifier, which indicates the degree of certainty that the proponent assigns to the Claim (eg unlikely, possibly, highly

probable, or beyond any reasonable doubt) and this feature can hold the key to resolution of conflicts. Authors can back off or limit their claims to account for the views of the referees, and the editor can in good conscience publish the article.

The readers The end users of the published paper, the readers, have been defined as anyone who reads the text “with an intentional search for meaning.”35 Editors and referees are knowledgeable about their fields and like the authors suffer from the problem of familiarity with the assumptions, conventions, and expectations of investigators in their field so that they tend to “fill in” what an author might leave out. General readers, however, differ from the editors and referees in that on average they are less familiar with the research field and may lack information required to understand the submission. Expressed another way, they can’t fill in what the author leaves out. As the readers vary widely in their expertise, it falls to the author to determine what they are likely to know (ie, what is common knowledge to everyone in the field), and conversely what the author needs to point out to them. Anything that is novel or unusual needs to be described in detail; for example, investigators may vary from the standard methods in their measurements or calculations, and such changes need to be highlighted and explained.

Editorial independence Ideally, the contents of the journal should be independent of economic issues, but this is not necessarily the case. Publication of color illustrations can be prohibitively expensive, and many respected journals are publications of learned societies that operate on lean budgets. The Journal of Dental Research, for example, is published and subsidized by the International Association for Dental Research. Such a journal would be expected not to be subject to advertisers’ influence. Other journals have a need to generate income, and, in some instances, entire issues appear to be sponsored by a commercial interest. It is not unreasonable to wonder whether the advertiser influenced the editorial content, for “he who pays the piper calls the tune.” In recent years, Internet-based journals have arisen that are financed by authors through charges per page. As hard copy of the articles are not produced or distributed, costs are minimal, and the potential for profit is

8

Brunette-CH01.indd 8

10/9/19 10:27 AM

The Road to Publication

great. There is thus an incentive for such journals to have a very low (or no) rejection rate, and questionable quality may result.

Three general questions A scientific paper is not necessarily an unbiased account of observations; it is more likely an attempt to convince the reader of the truth of a position. As noted by Ziman,36 it is important to realize that much of the research literature of science is intended, rhetorically, to persuade other scientists of the validity of received opinions. Thus, a reader can expect an author to present his or her data in the most favorable light. Tables, figures, and even calculations may be done so that differences between groups are accentuated and the appearance of error is minimized. A reader’s defense as a consumer of this information is an attitude of healthy skepticism. Three general questions a skeptical reader should ask are: Is it new? Is it true? Is it important?37

Is it new? A minimum requirement for publication in most instances is that the information is new. However, new can be defined in various ways. If a paper using standard histologic techniques reporting the development of teeth in lynx were to be published tomorrow, it might well be new, because, as far as I am aware, the development of lynx teeth has not been described previously. However, it probably would not be new in adding anything to our knowledge of tooth development in general. Such a paper would merely fill in the gaps, however small, in our knowledge. I think that journal editors are fairly lenient in their judgments on what constitutes new information. Kuhn38 states that one of the reasons why normal puzzle-solving science seems to progress so rapidly is that its practitioners concentrate on problems that do not tax their own lack of ingenuity. The quality that often distinguishes good scientific papers from the mediocre is originality. Funding agencies are probably better gatekeepers of science in this regard, because an essential criterion for funding is originality. Originality can appear in any component of the research process, including the questions being asked, the methods employed, the research design, or even the interpretation. Because science is a progressive business, approaches that were once original and sufficient can with time become derivative and

deficient. Returning to the example, because scientists have been studying tooth development for decades using standard histologic techniques, there is not much hope that reworking the same approach would provide anything exciting; new methods would be required to bring new insights. As a consequence of scientific progress, methods become outdated and standards change. Changing standards can be seen in biochemistry by examining the standards for publication of data using polyacrylamide gels. Early publications using the technique showed photographs of gels that did not have good resolution or uniformity and showed poor staining. The photographs of gels were often so uninformative that Archives of Oral Biology instructed authors to submit densitometric tracings of the gels. Currently, gel separations are done in two dimensions with superb resolution, and the proteins are stained with much greater sensitivity. A photograph of a gel that would have been acceptable 30 years ago would not be acceptable for publication today. In judging papers, therefore, a key question is whether the techniques and approach are up-to-date as well as whether the question is original. This principle is so well accepted that investigators sometimes rush to apply new techniques to appear up-to-date. Fisher,39 the pioneer statistician and author of a classic work on experimental design, warned, “any brilliant achievement . . . may give prestige to the method employed, or to some part of it, even in applications to which it has no special appropriateness.” An exception to the requirement of “newness” for a publication is the need to report confirmations of previous work. One journal for which I refereed placed the category “valuable confirmation of previous work” in third place in their ranking system below exciting original research and interesting new findings, but above categories related to rejection. This type of research is taking on increasing importance in the light of the “reproducibility crisis” to be discussed later.

Is it true? Sound conclusions are the result of reliable observations combined with valid logic. Knowledge of measurement, types of observational errors, experimental design, and controls give some basis to assessments of the reliability of observations. Thus, sections of this book deal with these topics and the logic used to interpret the observations. But the ultimate test of any scientific observation is reproducibility; indeed, a practical definition of truth for the purposes of pragmatic working scientists is that a scientific statement is true if

9

Brunette-CH01.indd 9

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

it allows us to make useful, reliable predictions that are confirmed when tested by a competent scientist under the specified conditions. There are theoretical or practical limitations to any approach. Newton’s laws of motion would be perfectly valid when applied to billiard balls colliding on a pool table but not useful at very small scales of the world of subatomic particles where quantum physics would be preferred. Note that confirmation does not imply the exact same numeric result, but rather one that is within the specified interval of reported uncertainty. A clue to the reproducibility of an observation is the consistency of the results within the body of the paper. Another means for evaluating the reliability of observations in a paper is to read what other scientists have written about the work, and citation analysis is an efficient means of uncovering that information. For various reasons, to be discussed later, there is a current “reproducibility crisis” perceived in which confidence in the reproducibility of findings, even those published in high-impact journals, is waning. A student might wonder whether it is necessary to learn such diverse concepts and examine the literature to such a detailed extent, particularly when it seems likely that the vast majority of publications are produced in good faith and come from institutions of higher learning. Ioannidis,40 however, has argued that most published research findings are false. Ioannidis’ estimate is sensitive to the pretest probability of the hypothesis being true, and a low estimate of this value will lead to a higher proportion of papers’ conclusions being classed as false. Nevertheless, as will be discussed later, current research into reproducibility of findings has provided more direct evidence indicating a significant proportion of findings are false in that they cannot be reproduced. In Ioannidis’ common sense view, a research finding is less likely to be true when effect sizes are small, when there are a large number of tested hypotheses that have not been preselected, and when there are great flexibilities in designs, definitions, outcomes, and data analyses. Other problems impacting the truth of the conclusion are financial issues and other interests and prejudices as well as the number of teams in a field chasing statistical significance. I believe it is unlikely that most research findings are false, because if they were there would be more papers reporting failure to confirm results (though admittedly publishing such negative results can be difficult) and many fewer papers confirming—albeit often indirectly—replication. Nevertheless, the considerations listed by Ioannidis serve to warn readers of the dental

and medical literature that there is no shortage of well-documented threats to truth.

Is it important? The importance of a paper cannot be tested in a completely objective manner. Maxwell41 has argued— in my opinion, persuasively—that real progress in science is assessed in terms of the amount of valuable factual truth that is being discovered and that the accumulation of vast amounts of trivia (even if factually correct) does not amount to progress. The problem is that value judgments are highly subjective. One approach to measuring impact of a paper is the number of citations to the paper, an aspect that will be discussed in chapters 5 and 22. Many scientists have accepted this criterion and include the citation record of their papers in their curriculum vitae or include indices derived from their citation record such as the h-index (discussed in chapter 5). One can speculate about what qualities an ideal evaluator should have. Beveridge 42 has suggested the concept of scientific taste, which he described as a sense of beauty or esthetic sensibility. Beveridge explained scientific taste by stating that: The person who possesses the flair for choosing profitable lines of investigation is able to see further where the work is leading than are other people because he has the habit of using his imagination to look far ahead, instead of restricting his thinking to established knowledge and the immediate problem. A person with scientific taste would be a good judge of the importance of a scientific paper. Traditionally, the skill of judgment is learned in the apprenticemaster relationship formed between graduate student and supervisor. Techniques may come and go, but judging what is important and how it can be innovatively studied are the core business of scientists, and these skills are learned similarly to a child learning his prayers at his mother’s knee: Graduate students hone their critical skills in the supervisor’s office or lab meetings. Thus, much importance is attached to the pedigree of a scientist, and some scientists take pride in tracing their scientific pedigrees to leading figures in a field of study. Given the large variation in laboratory and supervisor quality, there will always be significant differences in judgment. This diversity is evident in an extensive

10

Brunette-CH01.indd 10

10/9/19 10:27 AM

References

study of proposals submitted to the National Science Foundation. The study found that getting a research grant significantly depends on chance, because there is substantial disagreement among eligible reviewers, and the success of a proposal rests on which reviewers happen to be accepted.43 Moreover, there is evidence that complete disagreement between pairs of referees assessing the same paper is common.43 In biomedical science, the frequency of agreement between referees was not much better than that which would be expected by chance.44 Hence, it appears that objective and absolute criteria for the evaluation of a paper prior to publication are not available. Chapter 22 attempts to cultivate the skill of judgment by providing information on recognized sources of errors in judgments as well as citation analysis, a technique that can be used to access broadly based scientific assessments of published works.

References 1. 2. 3. 4.

5. 6.

7. 8.

9.

10. 11. 12. 13.

14.

Chargaff E. Triviality in science: A brief meditation on fashions. Perspect Biol Med 1976;19:324. Norris SP. Synthesis of research on critical thinking. Educ Leadersh 1985;42:40–45. Kurland DJ. I Know What It Says—What Does It Mean? Critical Skills for Critical Reading. Belmont, CA: Wadsworth, 1995:164. Butler CB. New ways of seeing the world. In: Butler CB (ed). Postmodernism: A Very Short Introduction. New York: Oxford University, 2002:37–42. Bronowski J. The common sense of science. In: Bronowski J (ed). The Common Sense of Science. New York: Vintage, 1967:97–118. Chambers DW. Lessons from badly behaved technology transfer to a professional context. Int J Technol Transfer Commercialisation 2005;4:63. Chambers DW. Changing dental disease patterns. Contact Point 1985;63:1–17. Fish W. Framing and testing hypotheses. In: Cohen B, Kramer IRH (eds). Scientific Foundations of Dentistry. London: William Heinemann, 1976:669. Perkins DN, Allen R, Hafner J. Difficulties in everyday reasoning. In: Maxwell W (ed). Thinking: The Expanding Frontier. Philadelphia: Franklin Institute, 1983:177. Zipf GK. Human Behaviour and the Principle of Least Effort. Cambridge: Addison-Wesley, 1949. Pratkaris AR, Aronson E. Age of Propaganda: The Everyday Use and Abuse of Persuasion. New York: WH Freeman, 2001:38. Kahneman D. Attention and effort. In: Kahneman D (ed). Thinking, Fast and Slow. Toronto: Anchor Canada, 2013:31–38. Risen J, Gilovich T. Informal logical fallacies. In: Sternberg RJ, Roediger HL, Halpern DF (eds). Critical Thinking in Psychology. Cambridge: Cambridge University, 2007:110–130. Garfield E. Current comments: Is the ratio between number of citations and publications cited a true constant? Curr Contents 1976;6:5–7.

15. Glenny A, Hooper L. Why are systematic reviews useful? In: Clarkson J, Harrison JE, Ismail A (eds). Evidence Based Dentistry for Effective Practice. London: Martin Dunitz, 2003:59. 16. Garfield E. The literature of dental science vs the literature used by dental researchers. In: Garfield E (eds). Essays of an Information Scientist. Philadelphia: ISI, 1982:373. 17. Relman AS. Journals. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:67. 18. Worthington H, Clarkson J. Systematic reviews in dentistry: The role of the Cochrane oral health group. In: Clarkson J, Harrison JE, Ismail A (eds). Evidence Based Dentistry for Effective Practice. London: Martin Dunitz, 2003:97. 19. Chambers DW. Habits of the reflective practitioner. Contact Point 1999;79:8–10. 20. Day RA. How to Write and Publish a Scientific Paper. Philadelphia: ISI, 1979:2. 21. DeBakey L. The Scientific Journal. Editorial Policies and Practices: Guidelines for Editors, Reviewers, and Authors. St Louis: Mosby, 1976:1–3. 22. Simonton DK. Creativity in Science: Chance, Logic, Genius, and Zeitgeist. Cambridge: Cambridge University, 2004:85–86. 23. Goodman SN, Berlin J, Fetcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med 1994;121(1):11. 24. Giannobile WV. Editor’s Report for the Journal of Dental Research–2015. http://dentalresearchblog.org/jdr/?p=338. Accessed 18 July 2018. 25. Tacker MM. Parts of the research report: The title. Int J Prosthodont 1990;3:396–397. 26. Chernin E. Producing biomedical information: First, do no harm. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:40–65. 27. Day RA. How to Write and Publish a Scientifc Paper. Philadelphia: ISI, 1979:72. 28. Cialdini RB. The science of persuasion. Sci Am 2001;284:76 –81. 29. Gross AG. The Rhetoric of Science. Cambridge: Harvard University, 1990. 30. Gopen G, Swan J. The science of scientific writing. Am Sci 1990;78:550–558. 31. Strunk W, White EB The Elements of Style, ed 4. Boston: Allyn and Bacon, 1999. 32. Zinnser W. On Writing Well, ed 2. New York: Harper and Row, 1980. 33. Walton D. Argument as reasoned dialogue. In: Walton D (ed). Informal Logic: A Pragmatic Approach, ed 2. New York: Cambridge University, 2008:1–34. 34. Toulmin SE,The Uses of Argument, updated ed. Cambridge: Cambridge University, 2003. 35. Lang TA. How to Write Publish and Present in the Health Sciences: A Guide for Clinicians and Laboratory Researchers. Philadelphia: American College of Physicians, 2010:41. 36. Ziman J. Reliable Knowledge: An Exploration of the Grounds for Belief in Science. Cambridge: Cambridge University, 1978:7. 37. DeBakey L. The Scientific Journal: Editorial Policies and Practices: Guidelines for Editors, Reviewers, and Authors. St Louis: Mosby, 1976:1. 38. Kuhn TS. The Structure of Scientific Revolutions, ed 2. Chicago: Chicago University, 1970:184. 39. Fisher RA. The Design of Experiments, ed 8. Edinburgh: Oliver & Boyd, 1953:184. 40. Ioannidis JP. Why most published research findings are false. PLoS Med 2005;2:e124.

11

Brunette-CH01.indd 11

10/9/19 10:27 AM

REASONS FOR STUDYING CRITICAL THINKING

41. Maxwell N. Articulating the aims of science. Nature 1977;265:2. 42. Beveridge WI. The Art of Scientific Investigation. New York: Vintage, 1950:106.

43. Cole S, Cole JR, Simon GA. Chance and consensus in peer review. Science 1981;214:881. 44. Gordon M. Evaluating the evaluators. New Sci 1977;73:342.

12

Brunette-CH01.indd 12

10/9/19 10:27 AM

2 Scientific Method and the Behavior of Scientists

B

ecause there is no one scientific method, any account of scientific method is bound to be incomplete or even inaccurate and misleading. Sir Peter Medawar, a Nobel laureate, has stated that “there is no such thing as a calculus of discovery or a schedule of rules which by following we are conducted to a truth.”1 Discoveries are made by individual scientists, who often have their own original and distinctive ways of thinking, and they do not necessarily follow any rigid protocol of scientific method. The simple view advanced by Bronowski2 is that scientific method is organized common sense, and indeed this concept is emphasized in this book. However, the inclusion of the term organized is not a small addition, for scientific method differs from common sense in the rigor with which matters are investigated. For example, precise operational definitions, procedures to quantify, and theories to explain relationships are often employed, and a great effort is made to avoid inconsistencies. Results should be subject to the systematic scrutiny of the investigator or other scientists, and limits on how far the results can be applied should be sought. Formal methods for describing scientific method are still the topic of philosophic examination. But philosophic speculation or practice do not greatly concern typical research scientists, who are largely occupied in puzzle solving3 and who often seem too busy to consider how they got from A to B or how their investigational strategies relate to any philosophic concepts. There is increasing recognition that a key aspect in the development and acceptance of scientific facts and theories is the social interaction among scientists. Descriptions of an ideal exist for both the scientific method and the behavior of scientists, but the ideal does not always correspond with reality. Nevertheless, as they are the norms they will be discussed here.



Thou hast made him a little less than angels.” HEBREWS 2:7

13

Brunette-CH02 SRM.indd 13

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

The Behavior of Scientists The everyday life of a scientist In his chapter on discovery, Fred Grinnell gives a good account of how a working scientist operates.4 In brief, Grinnell started out on a project to study citric acid cycle enzymes under conditions of altered energy metabolism in rat liver cells. Such studies typically involve inhibitors, and Grinnell found that the addition of one inhibitor, arsenite, to the incubation medium had a surprising result, namely the cells to be cultured did not stick to the dish. This unexpected finding held a possible means to investigating a larger problem of interest, namely investigating the mechanisms of cell adhesion. After consulting with senior colleagues, Grinnell became convinced that cell adhesion was an important issue and proceeded to look at the problem in more detail. The chapter on discovery documents Grinnell’s initial studies by including pages from his lab book and photomicrographs. Initially the reader might not be impressed by the quality of the micrographs or the messiness of the lab books that do not at all look like the polished pages of a published paper. But as the project proceeded, the quality of the photomicrographs was improved and enabled Grinnell to notice and quantify changes in cell shape, in particular that treatment of cell culture surfaces with serum enabled the cells not just to attach in the form of rounded cells but to spread out. So now Grinnell had a system in which cell adhesion could be altered by a known treatment and the various possible mediators of the spreading effect of serum dissected out and tested. Eventually Grinnell contributed to the discovery and elucidation of function of the biologic adhesion protein fibronectin, and helped to establish the importance of fibronectin in wound repair. In these early experiments though we can see some of the essential issues and processes of the working scientist. First, there was a concentration of interest on an important problem. Second, there were unexpected novel findings that the investigator realized had potential for further investigation, and third, there was a refinement of technique so that the biologic processes could be measured and dissected. Fourth, there were decisions that had to be made, such as discontinuing the original line of investigation when a more interesting aspect emerged. Fifth, there were no rules or grand plan that were slavishly followed but rather an interactive approach between what was found and what was best done next to solve a problem. Inherent in the description, as well as other parts

of Grinnell’s book, is that solving scientific puzzles is fun and that indeed, researchers feel their job could best be described as “how to get paid for having fun.” Hold the attractive thought that on a day-to-day basis a researcher is having fun as we consider some of the other issues in scientists’ behavior that are uncommon and sometimes less than pleasant.

Aspects of the sociology of science The pioneer sociologist of science, Merton,5 identified six guiding principles of behavior for scientists: 1. Universalism refers to the internationality and independence of scientific findings. There are no privileged sources of scientific knowledge6; scientific results should be analyzed objectively and should be verifiable and repeatable. In practice, this norm means that all statements are backed up by data or citations to published work. Internationalism is one of the characteristics of modern science that emphasizes collaboration; papers frequently have multiple authors from different institutions and countries. 2. Organized skepticism describes the interactions whereby scientists evaluate findings before accepting them. Ideally, scientists would check results by repeating the observations or experiments, but this approach is time consuming and expensive. At the very least, scientists try to determine whether reported results are consistent with other publications. An ironclad rule of science is that when you publish something, you are responsible for it. When a finding is challenged, the investigator must take the criticism seriously and consider it carefully, regardless of whether the investigator is a senior professor and the challenger the lowliest technician or graduate student.7 3. Communalism is the norm that enjoins scientists to share the results of their research. “Scientific knowledge is public knowledge, freely available to all.”6 One factor acting against the free exchange of information in a timely manner is the growing commercialization of scientific research. As both the institution and the principal investigator may benefit financially by obtaining rights to intellectual property, the time required to obtain patents results in delays in transmitting findings to the scientific community. 4. Disinterestedness is summed up by the dictum “Science for science’s sake.” At present, this norm

14

Brunette-CH02 SRM.indd 14

10/9/19 10:31 AM

The Behavior of Scientists

appears to be honored more in breach than in observance. As noted previously, many scientists patent their discoveries and form alliances with commercial interests. At one time, I served on a grants committee that dispersed funds for major equipment. It happened that two applications from the same institution requested similar equipment, although there was not enough work to justify this duplication. One panel member wondered why the two groups did not combine and submit a single application. As it turned out, the two university-based principal investigators collaborated with different commercial interests and wanted a barrier between their laboratories. Ideally, scientists should not have any psychologic or financial stake in the acceptance or rejection of their theories and findings. 5. Humility is derived from the precept that the whole of the scientific edifice is incomparably greater than any of its individual parts. This norm is in operation whenever scientists receive awards; they inevitably thank their coworkers in glowing terms. Scientists giving invited lectures in symposia are generally at pains to point out which graduate students or postdoctoral researchers actually did the work and include lab pictures in their presentations to share the glory (small though it may be). Despite the norm of humility, clashes of egos still occur. Several rules of behavior govern the interaction of scientists in the case of a disagreement. Discussion should be detached; that is, the issues, and not the personalities, should be discussed. The question is not who is right but rather what is right. The debate should be constructive; for example, if a referee decides that a paper should be rejected, the referee’s comments should indicate how the paper could be improved. Finally, scientists who disagree should be courteous; they can disagree without being disagreeable. 6. Originality is highly prized in science and features prominently in determining who wins awards and grants. Yet, originality is difficult to define precisely, and, as Merton8 noted, there is a “gap between the enormous emphasis placed upon original discovery and the great difficulty a good many scientists experience in making one.” The originality of a scientific work can reside in the novelty of the hypothesis being investigated, the methods used to investigate it, as well as in the results obtained. Perhaps the most common scientific strategy—the transfer method—involves applying methods and concepts from one field to the particular topics of an adjacent

field. For example, one could test the effect of a drug in a rat macrophage cell line that already had been investigated in a mouse macrophage line. On such minor contributions has many a career been built.

Recognition is the “coin of science” In the traditional description, scientists are portrayed as altruistic individuals, devoid of personal or selfish considerations and engaged in the objective search for truth. Scientists often adopt—or at least pay lip service to—this view. The American Scientist,9 for example, published 75 case histories on “Why I became a scientist.” Aside from two individuals, these scientists seemed to attach little importance to a good salary, an issue that greatly concerns many other professionals. Yet, anyone working with academics knows that this mundane matter is often hotly disputed. In an anthropologic investigation into laboratory life, Latour and Woolgar10 found that scientific activity is geared toward the publication of papers and, moreover, that personal motivations and interactions are prime factors in determining what gets done. High on the list of motivators is recognition. According to Cronin,11 recognition is the exchange on which the social system of science hinges. Investigators insist that their work be cited where appropriate and dispute priority claims vigorously. A prominent example is the dispute between Robert Gallo and Luc Montagnier over priority in the discovery of HIV as the cause of AIDS. That conflict ended with the protagonists jointly writing a history of the discovery. Such disputes would be unlikely to occur between humble men. Thus, there seems to be considerable discrepancy between the ideals of scientific behavior and the way scientists actually behave.

High-impact research, collaboration, and the “shadow of the future” Simonton12 has reviewed the characteristics of highly creative scientists doing high-impact research. Great scientists tend to possess the ability to ask critical questions for a variety of topics. They do not work on just one project at a time but rather involve themselves with a number of independent projects simultaneously while employing a core set of themes, issues, perspectives, or metaphors. These projects may differ in their feasibility, intrinsic importance of the questions, interaction with other projects, specific type of

15

Brunette-CH02 SRM.indd 15

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

research, progress, and amount of effort demanded of the investigator. Increasingly, modern research scientists find collaboration worthwhile. Several studies have shown that the most prolific scientists tend to collaborate the most (briefly reviewed in Surowiecki13). Nobel laureates, for example, collaborate more frequently than their run-of-the-mill scientific colleagues. Collaboration extends the range of topics that a scientist can investigate efficiently, because it provides a source of knowledge and technical competence not found in the investigator’s laboratory. Moreover, in a collaborative project, work is almost automatically assigned to the laboratory where it most efficiently can be accomplished. Collaboration does come at the price of adding names to papers, which makes it seem that a scientist might lose some recognition by having to share it with others. I think, however, that for the important bodies of recognition of much research— namely, promotion, tenure, and grants committees—it is the number of papers that counts; the number of authors on a paper is most often not rigorously taken into account. That is, publishing four papers with three coauthors would be perceived more positively than publishing a single paper as a sole author. Collaboration seems to be most effective when done locally. Academic researchers were found in one study to spend only a third of their time with people not in their immediate work group; nevertheless, a quarter of their time was spent working with people outside their university.13 The political scientist Robert Axelrod14 postulated that cooperation is the result of repeated interaction with the same people. Trust is not really required to form a collaborative partnership; the key to collaboration is “the shadow of the future.” People in cooperative relationships should start off being nice, but they have to be willing to punish noncooperative behavior, with an approach of being “nice, forgiving, and retaliatory.” I have seen this principle in operation in grants committees and for manuscripts submitted for publication in which there does seem to be an element of “you score my grant (or paper) highly, and I will score yours highly.” Thus, old boys/girls networks are formed. Even professor-student relationships can be clouded by the shadow of the future. I know of one professor who distributed his course evaluation forms to his class with the trenchant proposition, “you give me a nice evaluation, I will give you a nice exam.” It would take a brave student to ignore the dark cloud

of a tough exam in the future—much safer to check all the boxes rating the forthright professor as excellent and hope that fellow students exhibit the same enlightened self-interest. A case study of the effectiveness of retaliation is provided by Sir Cyril Burt, an influential British psychologist who was accused of falsifying data on the nature/ nurture question of intelligence. Burt relished controversy and, according to Hearnshaw,15 “never missed a chance to give back more than he got.” Burt was not scrupulous in representing his opponents’ arguments fairly and resorted to such devices as obfuscation, misrepresentation (eg, using fictitious names, he wrote and published letters to the editor in the journal that he edited himself), and making claims that were hard to verify.16 As a high-handed editor of a prominent British journal, Burt was someone who you crossed at your peril. For a young scientist developing a career, cooperating with Burt was a much more attractive option than controverting with him.

Objectivity Objectivity remains the cornerstone of scientific method. A scientist should be a disinterested observer interpreting data without personal bias or prejudice. Ideally, scientific observations should be objective, meaning that the same observation would be made by any qualified individual examining the same phenomenon. Conditions of observation should be arranged so that the observer’s bias does not distort the observations. The personality of the scientist should be irrelevant to the observations. The mantra of practitioners of the new brand of scientific objectivity that emerged at the end of the 19th century was “let nature speak for itself.”17 One indication of the attempt for objectivity is the pervasive use of the passive voice in scientific writing, as this seems to impart some distance between the observer and the observation. (I should note that the practice is changing in scientific writing; it is fair to say that current opinion favors the use of the active voice.) In any case, merely writing in the passive voice does not guarantee objectivity. It has been said that 17th-century epistemology aspired to the viewpoints of angels; 19th-century objectivity aspired to the self-discipline of saints.17 These lofty goals are hard to obtain for 21st-century mortals; increasingly, the objectivity of scientists has been called into question.

16

Brunette-CH02 SRM.indd 16

10/9/19 10:31 AM

The Behavior of Scientists

Aberrant behavior: Lack of objectivity (Should the history of science be rated “X”?) This attack on the perceived objectivity of scientists has taken place on two fronts and involves two types of scientific soldiers: the generals of scientific history and the more humble common cannon fodder who fight for jobs, grants, and tenure. Consider the following generals: • Claudius Ptolemy, generally regarded as the greatest astronomer of antiquity, appears to have altered his observations to fit his theories. Moreover, his writings appear to be a “cover up.” One historian of science, Robert Newton, has dubbed Ptolemy “the most successful fraud in the history of science.”18 • There is good reason to think that Galileo was dishonest about some of his observations.19 • Isaac Newton manufactured accuracy by taking rough measurements and producing calculations that claimed accuracy to six or seven decimal places. Moreover, Newton selected data to fit his hypothesis. Newton’s correspondence with his publisher leaves little doubt about what he was up to; he adjusted values, one after another, so that laws would appear to fit observational data. Newton termed these actions “mending the numbers.” “No one,” concluded Westfall20 in his article “Newton and the Fudge Factor,” “can manipulate the fudge factor so effectively as the master mathematician.” • Einstein wrote “. . . it is quite wrong to try founding a theory on observable magnitudes alone. In reality the very opposite happens. It is the theory which decides what we can observe.”19 In all these instances, it appears that great scientists were subjective or at least considered observations to be secondary to theory. Kuhn21 gives numerous examples of revolutionary scientists rising above the morass of observational data to construct their elegant theories, which at the time of their inception were no better than those of their predecessors. This heresy is not new; over a half century ago Bondi22 suggested that, in astronomy, observations had proven a less reliable arbiter of hypotheses than had theoretical considerations. As there is little danger that an intellect like Einstein’s will be found in dental research, you may be wondering what these stories have to do with this book. They simply show that the traditional explanation of scientific method does not provide an adequate description

of how some outstanding scientists actually operate. Thus, it should come as no surprise that some of the lesser lights of science can also lack objectivity. This is particularly true when you consider that the rewards of science—grants, jobs, tenure, and even fame—go to the people who publish. The pressure to publish has been invoked as one of the reasons for the increasing problem of scientific fraud. But dubious scientific practices have attracted critical comment for some time. Charles Babbage, famous for inventing the first mechanical computer, produced a typology of scientific fraud in 1830.23 Hoaxing, forging, trimming, and cooking formed the unholy quadrivium of Babbage’s typology. The sins of forging and hoaxing are obvious enough, and Babbage viewed these as less serious problems for science, as forging was rare and hoaxes would eventually be discovered. Trimming consisted of making the reported precision of data appear better than it was by clipping off bits of data from high readings and adding them to low readings so that the reported average was not affected. Cooking comprised a range of practices that resulted in data agreeing with expectations. The term has considerable staying power; it was a common practice in my undergraduate days in physics labs to employ “Cook’s constant” to coax highly accurate values for physical properties (such as the gravitational constant “G”) from antiquated and poorly maintained equipment. Scientific fraud continues to be problematic today. Friedman24 compiled a list of contemporary highpowered research that is thought likely to be scientific fraud. Diverse disciplines including surgery, physics, malaria research, immunology, anthropology, chemistry, biology, and genetics have been thought to be victimized by fraudsters. Readers might think that oral health researchers, enlightened individuals that they are, might be exempt, but it too has a celebrated example of scientific misconduct. In 2006, Jon Sudbø, DDS, associate professor at the University of Oslo, published an article in The Lancet that was based on the study of some 900 patients whose records he had entirely fabricated. It was evident that something was amiss in that of the 908 people in the study, some 250 shared the same birth date. Improbable numeric oddities have led to suspicions of scientific misconduct in other cases, such as Sir Cyril Burt’s study on monozygotic and dizygotic twins, where correlation coefficients were the same to three decimal places across various articles even though new data had been added to the sample. The Sudbø case has been described as “the biggest scientific fraud conducted by a single researcher ever,”25 and subsequent investigations revealed that Sudbø used

17

Brunette-CH02 SRM.indd 17

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

fabricated data and/or falsified research in applications to the National Institutes of Health (NIH). Sudbø also admitted to falsifying and or fabricating data in three publications. After some disciplinary actions, including revocation of his doctorate in medicine the same year his article was published in The Lancet, Sudbø regained his license to practice medicine and dentistry with some restrictions. Dr Richard Horton, editor of The Lancet, called the fabrication of the data a “terrible personal tragedy” for Sudbø.26 Other recent examples of high-profile contemporary fraud are given in appendix 3 of Friedman.24 Traditionalists argue that frauds eventually will be found out, because other researchers will be unable to repeat the fraudulent results. Nevertheless, scientific fraud has become an important issue. In Betrayers of the Truth, Broad and Wade27 dissect some of the more widely known cases of scientific fraud, many of which concerned individuals in world-renowned institutions. In the instances reviewed by Broad and Wade, it was generally a coworker who blew the whistle; attempted replication of results by researchers in other laboratories played a minor role, if any, in the detection of fraud. Perhaps this is to be expected, because experimental systems often vary in detail from one investigator to another. If one investigator fails to confirm another’s results, the variations can be attributed to differences in the experimental systems. To set up exact replications would be tedious, and there is little reward attached to confirming another’s results. Thus, it is easier and more profitable for scientists to forge ahead with their own work rather than to confirm the work of others, and much data probably goes unchecked. Similarly, Kohn28 in False Prophets and Judson7 in The Great Betrayal examine instances of scientific fraud or suspected fraud in detail. Although there are abundant anecdotal reports of “bad apples,” Mosteller,29 a professor of statistics at Harvard and a former president of the American Association for the Advancement of Science, pointed out, perhaps with the statistician’s penchant for using Occam’s razor, that: “Before appealing to fraud, it is well to keep in mind the old saying that most institutions have enough incompetence to explain almost anything.” Just how prevalent and significant the problem may be is difficult to determine. An early informal survey on scientific cheating concluded that intentional bias (a euphemism for cheating) may be widespread. 30 Objective data on such activities are obviously difficult to obtain because potential informants are reluctant

to admit their failings. Nevertheless, a more recent survey,31 in which scientists were randomly selected from databases of the NIH Office of Extramural Research, incorporated mechanisms for confidentiality and obtained over 3,000 replies (a roughly 46% response rate). Misbehavior was broadly interpreted and included serious offenses, such as falsifying or cooking data, to the less serious, such as inadequate record keeping related to research projects. As many as one in three of the survey respondents admitted to committing at least one of the 10 most serious behaviors. The authors concluded that certain features of the research environment, such as perceived inequities in how resources are distributed, may have detrimental effects on the ethical dimensions of scientists’ work. Thus, the approach toward scientific fraud may be changing from blaming aberrant “bad apples” to determining how the environment shapes scientists’ behavior. The British Medical Journal devoted a special section to fraud in research in its issue of 6 June 1998 and commissioned five answers to the question of how best to respond to research misconduct. The NIH has established an Office of Research Integrity that, among other duties, sponsors research on the topic. It might be thought that cheating does not really affect scientific progress because in the long run errors will be corrected.3 In my view, there can be no doubt that scientific fraud and related scientific misdemeanors do impair the efficiency of science, that is, the amount of useful information obtained per dollar invested. In my experience, granting agencies do a poor job of assessing how effectively their research funds are spent. Moreover, in universities and research institutions, credit is disproportionately given to those obtaining research money (which benefits the institution because of accompanying money to cover indirect costs), rather than those producing new information. Research funding can be readily and speedily measured; research impact may require more than a decade to determine. This allocation of recognition and rewards to the most grant productive may lead investigators to reason that the ends justify the means, or, to quote Doty’s32 assessment of scientific misconduct, “the validity of an action is decided by whether one can get away with it.” The chances of getting away with the crime of obtaining research funds based on misleading data and unlikely promises are pretty good; granting panels or committees seldom reconvene to determine how good a job they did in funding significant research.

18

Brunette-CH02 SRM.indd 18

10/9/19 10:31 AM

The Behavior of Scientists

Unintentional bias

1. The effect being studied is often at the limits of detectability, and increasing the strength of the causative agent may not increase the size of the effect. 2. The scientists are ready to dispose summarily of prevailing theories and to substitute revolutionary new ones. 3. The scientists avoid doing experiments that are critical tests of their theories or observations. Two examples of pathologic science given by Rousseau33 are the “cold fusion” process of generating energy and the “infinite dilution” process, whereby a biologically active solution is diluted so many times that no active molecules can be present, yet the solution continues to have an effect. A study on infinite dilution was published in Nature34 and gave, in some people’s eyes, some credence to the practice of homeopathy. It is noteworthy that in the instances of both cold fusion and infinite dilution, the traditional description of scientific method was operative; the aberrant results were checked by other investigators and found to be erroneous. Objectivity can also be compromised by knowledge. It has often been said that what you see depends on how you look, but evidence suggests that what you see depends on what you know. For example, pathologists’ judgments can be dramatically influenced by prior knowledge of the clinical features of a case. Hetherington35 has reviewed several examples from astronomy in which the observations reflected personal

Control 0.06% metronidazole 20 Survivals (%)

Perhaps more crippling to the cause of truth is unintentional bias, which occurs when an investigator unknowingly designs a study so that the information obtained is in some way biased. Unintentional bias is widespread and happens in social science, for example, when one group in the population is not considered. Piaget’s classic studies on child development were based only on male children, and the work of renowned anthropologist Margaret Mead has been widely criticized for drawing conclusions about whole societies largely on the basis of interviews with teenaged girls. An extreme form of unintentional bias is self-delusion, where scientists believe they are practicing appropriate scientific method but have lost objectivity. Dennis Rousseau33 has proposed that errors in science created by a loss of objectivity consistently demonstrate a similar pattern. This “pathologic science” has three characteristics:

40 60 80 0

20

40

60

80

100

120

Age in weeks Fig 2-1 Survival versus age in controls and male rats receiving metronidazole. (Adapted from Rustia and Shubik36 with permission.)

biases rather than reality. In Hetherington’s view, the fact that errors in observation are correlated with the expectations of observers has been demonstrated beyond reasonable doubt, and “the warping of judgment by knowledge, the influence on observational reports by preconceived opinion, is inevitable.” The relationship of scientists to prevailing theories may be likened to that of fish to water; fish appear to be blissfully unaware of water even though the water must necessarily affect all their sensory inputs. Hetherington concludes, “the decline of theology followed from historical studies revealing that its supposedly divine sources were not absolute, but historically relative, subject to cultural forces. It remains to be seen whether historical studies of science may contribute to a strengthening of science’s better features or to a weakening of confidence in modern science.”35 The conclusion of this brief consideration of the behavior of scientists is that the reader must approach the literature with a critical and skeptical eye. For example, metronidazole, a chemotherapeutic agent against trichomonas infection, has proved useful in treating vaginitis and has been proposed for use in the treatment of periodontal disease. Two experienced cancer researchers tested the drug for carcinogenicity (Fig 2-1). They concluded, among other things, that the drug produced a significant incidence of pituitary and testicular neoplasms in male rats. However, the discussion paid little attention to the result that rats taking metronidazole lived longer. As in elderly people, cancer is more prevalent in elderly rats, and it is not surprising that there were more tumors in the metronidazole-treated group, because these rats lived long

19

Brunette-CH02 SRM.indd 19

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

True state of nature

Selection

Observational window (possible distrotions) Reliable observations Research design

New and existing data

Acceptance by peers (evidenced by citations)

Valid logic

Organized skepticism

Balanced judgement

Acceptance by editor: publication Rhetoric

New hypothesis/ conclusion Rhetoric Submission for publication

Assessment Referee’s comments Resubmission

Objectivity

Fig 2-2 Typical products and processes in the production of scientific knowledge.

enough to develop tumors. It appears that the authors were so intent on looking for carcinogenic effects that they ignored the observation that they had discovered the “fountain of youth” for rats.36

The Storybook Version of Scientific Method A traditional description of scientific method, sometimes called the storybook scientific method, incorporates several sequential steps. 1. Investigators must first decide what questions can be asked and whether the questions are worth asking. 2. They gather and organize the information pertaining to the problem.

3. They form a working hypothesis that represents the most likely answer to the question. 4. They make observations that test the hypothesis. This stage involves choices about strategies and techniques. 5. If the results contradict the predictions of the hypothesis, the hypothesis is modified. 6. If the results agree with the hypothesis, the investigators devise other tests of the hypothesis in an attempt to prove themselves wrong. 7. If these subsequent tests support the hypothesis, the investigators accept the hypothesis and publish the results, which other scientists then attempt to replicate. 8. Successful replication as well as other tests of the hypothesis eventually lead to a consensus whereby another piece of information is added to the body of science. One problem with storybook scientific method is that it contradicts what we know about how humans

20

Brunette-CH02 SRM.indd 20

10/9/19 10:31 AM

The Storybook Version of Scientific Method

actually behave. People are reluctant to change their beliefs, even in the face of empirical evidence that proves them wrong (see chapter 22). The storybook scientific method gives scientists the duty to prove themselves wrong, contrary to people’s natural inclination to prove themselves right. Moreover, the storybook version implies that replication of experiments should be the norm; the reward system of science is such that replication is unusual. Even if the storybook version does not accurately describe the actual practices of scientists, it is commonly used in presenting the results of research and in justifying conclusions. Another problem with the storybook scientific method is that it misses out on the cut and thrust of scientific dialogue, where typically scientists use all the rhetorical skills at their disposal to convince others of the truth of their conclusions. A somewhat more detailed and pragmatic approach to describing the production of scientific knowledge is given in Fig 2-2, and aspects of the process are described next.

Observation An important limitation Science is concerned with what can be observed, and observations are impressions made via the senses. The limitation of direct observation precludes certain important philosophic questions, such as the existence of God, from being answered by science. Medawar37 notes that “it is not to science, but to metaphysics, imaginative literature, or religion that we must turn for answers to questions having to do with first and last things.” Figure 2-2 posits that there is, in fact, a “true state of nature” that exists independent of any observers. In doing so, it ignores an ancient debate in philosophy between two opposing schools of thought: realism and idealism. Realists argue that the aim of science is to provide a true description of the world. Idealists hold that the physical world is in some way dependent on the conscious activity of humans.38 An idealist might argue that an apple is not necessarily red in the dark (because no one can see it), or that a tree that falls in an isolated forest makes no sound (because there is no one to hear it). This kind of metaphysical debate involving observable/unobservable distinctions, if it is relevant to any science at all, generally does not concern biomedical scientists.

A fundamental principle: Independence of the observer from the observation Ziman 39 states that the fundamental principle of scientific observation is that all human beings are interchangeable as observers, and scientific information is restricted to those observations on which independent observers can agree. The assumption that all observers are equivalent is not merely a basic principle of Einstein’s theory of special relativity; it is the foundation stone of all science. But even this general rule seems to have an exception. Qualitative field research is the disciplined inquiry examining the personal meaning of individuals experiencing and acting in their social environments.40 The researcher is more part of the phenomenon being investigated than a detached observer, and thus different observers might draw different conclusions. But even in qualitative field research, to attempt to ensure the reliability of the findings, scientists have developed strategies to minimize observer effects, such as (1) checking with the subjects if the observations are credible, (2) using prolonged times of observation, and (3) using triangulations in which data and theoretical interpretations are cross-checked between observers and studies.

Selection of observations What is chosen to be observed depends on the immediate purpose of the investigation. In the December 2002 issue of the Canadian Association of University Teachers, Nobel laureate John Polanyi stated that for scientists, “the decision [of] what to investigate . . . is the most fateful of their lives.”41 Choosing what to observe is not just a matter of satisfying an idle curiosity. “For to obtain an answer of note one must ask a question of note, a question that is exquisitely phrased.”41 I always ask graduate students who are considering thesis topics, “What is the best possible outcome if everything goes right with the study you are planning? In particular, can the results be published and where?” Too often students plan studies that are doomed from the start by asking questions of insufficient importance or negligible novelty. On the other hand, no scientist is admired for failure to solve problems that lie beyond his or her competence. Medawar42 notes that the most such individuals can hope for is the “kindly contempt earned by the Utopian politician. If politics is the art of the possible, research is surely the art of the soluble. Both are immensely practical minded affairs.” In this regard I have noticed that a criterion frequently employed in grant evaluations is

21

Brunette-CH02 SRM.indd 21

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

feasibility, the concept being that the investigator has the necessary experience, equipment, and environment to perform the proposed studies. Until recently, funding agencies seemed to apply the cardinal rule that the observations must answer specific questions or test specific hypotheses. To label a proposal a “fishing expedition” was the kiss of death. The advent of genomic techniques has brought more acceptance to those proposing fishing expeditions, which are now fashionably called exploratory research. It remains true, in my opinion at least, that asking specific questions is the best strategy for getting publishable answers. Observations are made in the context of an overall research design. Typically, the design is based on the hypothetico-deductive model, whereby an investigator has a hypothesis (often based on previous studies), such as “substitution of sucrose in the diet by xylitol will reduce caries,” and tests it. In designing the test, the researcher will have to determine a feasible investigative approach. If, for example, an experimental strategy is chosen, the investigator has to make decisions on such aspects as amounts of xylitol to be provided, the population to which it is administered, ethics approval, measurement of sugar consumption, what teeth will be examined, how they will be examined, and when they will be examined. If the investigator decides on a case-control study, different issues arise, such as recruiting xylitol gum users and nonusers who are closely matched in relevant variables. The different designs will yield conclusions of varying certainty, and it may turn out that the most rigorous strategy is not feasible and compromises are necessary.

Issues in clinical studies In clinical studies, the selection of observations often comes down to asking two principal questions: who and what? Who The ideal is that the sample of people studied is representative of the population to whom the conclusions will be applied. Nonrepresentative samples can lead to erroneous conclusions, and editorial guidelines for dealing with the problem have been established.39,43–45 Guidelines include the following:

1. All participants in a study should be accounted for. 2. No more than 15% of the patients should be unavailable for follow-up.

3. The patients should be adequately described with clear eligibility (ie, inclusion/exclusion) criteria. 4. Details on randomization should be given. What Any clinical study must have an outcome variable or endpoint, that is, a measurement used to judge the effect of the intervention. A potential problem is the choice of an outcome variable. Sometimes investigators employ a surrogate end point (often a risk factor) that is related but not identical to the real variable of interest. For example, if the interest was in the effect of mouthrinse on dental caries, a researcher might measure the mouthrinse’s effects on the accumulation of dental plaque or Streptococcus mutans. Because plaque, particularly the strain S mutans, has been implicated in the development of caries, the rationale follows that its reduction would result in the reduction of caries. This surrogate end point would enable an investigator to complete the study more quickly and cheaply than would be possible with a full-scale clinical trial using dental caries as the end point. But it might give misleading results if the mouthrinse removed plaque from surfaces that were not prone to caries but failed to remove it from pits and fissures. There might be an overall reduction of plaque but no change in tooth decay. The use of surrogate variables is particularly widespread in advertisements for dental products, because it is almost always cheaper and quicker to use surrogate end points than to perform a complete clinical study. In medical research, the case of an inappropriate end point can be a life-or-death matter. In discussing this issue, Sackett46 uses the example of clofibrate, a drug that caused an almost 10% drop in serum cholesterol level (a key coronary risk factor) but increased the death rate by almost 20%.47

Classification of observations Classification of observations can be a complex process. The ideal is to group objects by some well-defined character or group of characters into classes that do not overlap. Much of pathology and classical biology is concerned with problems of classification, but classification can become an end in itself and sterile. Carolus Linnaeus devised methods for the classification of biologic materials that are still used today, but he also devised a method for classifying scientists according to military rank that is totally useless. (Needless to say, Linnaeus was a general in his ranking system.48)

22

Brunette-CH02 SRM.indd 22

10/9/19 10:31 AM

The Storybook Version of Scientific Method

Observational reactivity A problem common to all observations is the probability that the act of observing alters what is being observed, just as television cameras covering a political event alter the actions of the participants.

Qualitative versus quantitative Scientific observations may be qualitative or quantitative, with the latter being preferred. The chapters on measurement (chapter 12) and errors of measurement (chapter 13) address quantification in detail. However, qualitative observations can be valuable, particularly in sociologic studies. The most frequent techniques of qualitative research are in-depth interviews and participant observation. The information gathered may be used to produce hypotheses on topics on which there may be very little information, to describe social phenomena, and to gain greater insight into the mechanisms through which a known causal process works.49

Hypothesis Hypotheses are provisional explanations of observations, conjectures on the nature of things that typically lead to predictions on what will be observed under specific conditions. Finding a well-stipulated hypothesis in some papers is difficult, but it is usually found in the introduction. The lack of a hypothesis often means that the author is adopting an unfocused and unproductive approach. In formulating a hypothesis, the scientist considers not only the data from the current study but the existing data in the scientific literature. Useful hypotheses are simple, remain internally consistent, explain all relevant facts, and, most importantly, are testable. The formation of hypotheses requires creative imagination. In the storybook description of scientific method, the investigator tests a hypothesis by trying to disprove it. Thus, scientists attempt to prove themselves wrong, an approach that differs from the natural tendency of people to prove themselves right. Moreover, at least in the storybook approach, scientists consider plausible alternative explanations for the data and show that these alternative hypotheses are incorrect. Typically, this is done in experimentation by the inclusion of appropriate controls. Only when all other hypotheses are exhausted do investigators reluctantly conclude that their own hypotheses are correct. Once they have done so, they have to convince other scientists.

Publication, repetition, consensus, and recognition Publication Academic science has been defined as a social institution devoted to the construction of a rational consensus of opinion over the widest possible field.50 Observations must be published so that they can be considered by the scientific community. A paper’s chances of acceptance for publication are best if the conclusion is well supported. But conclusions are supported not only by observations but also by logical argument and the style in which the information is presented. To obtain acceptance for their paper, authors become advocates for their conclusions rather than adopt the storybook role of the scientist as a detached observer who is reluctantly forced to accept the conclusion. In evaluating a scientific paper, the reader is forced to evaluate the rhetoric, that is, the persuasive devices used by an author. Rhetoric appears in many places in a scientific paper. In the introduction, an author outlines the importance of the problem; in the materials and methods section, an author might emphasize that the techniques are reliable and well accepted; in the results section, an author might present data to make the effects seen in the study look as significant or as precise as possible; in the discussion, an author might select references that support a particular viewpoint. Unfortunately, scientific articles are written in a style featuring passive sentence construction and impersonal presentation of facts that readers find turgid and difficult to read.51 Moreover, scientific articles may obscure the process by which conclusions were reached, as the intent is to distill the results into a coherent, interpretable story. Louis Pasteur advised students writing scientific papers to “make it look inevitable.”52 The process involved in arriving at the conclusions (the process of discovery) is often ignored; ordinary laboratory events such as mistakes, equipment breakdown, squabbles between workers, and going down blind alleys— details that would make the article longer but would not add anything to scientific knowledge—are not mentioned. Thus, publications emphasize the process of demonstration and the documentation of the findings. A traditional problem in rhetoric concerns the issue of the order in which information is presented. As the process of demonstration requires a coherent story, the experiments may be presented in a different order in the paper than the sequence in which they were done in the laboratory. Likewise, the reasons given for doing an experiment in the paper may differ from the ones used in planning the study. Nobel laureate Peter

23

Brunette-CH02 SRM.indd 23

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

Medawar53,54 has argued that “the scientific paper is a fraud in the sense that it does give a totally misleading narrative of the process of thought that goes into the making of scientific discoveries.”

Repetition, acceptance, and consensus The agreement among scientists on the correctness of observations is obtained by repeating the observations. The ability of scientists to repeat an observation indicates a control over the conditions of observation, a situation often attained when the most relevant variables have been identified. A problem with the rapid growth of science is that many observations have not been checked, so researchers are unsure of the degree to which the data are trustworthy. Nevertheless, whether there is agreement can be determined only by publishing the results and observing other scientists’ reactions to it. The response of other scientists to published work has been described as “organized skepticism.” Under ordinary circumstances, proponents—that is, the authors of the paper—accept the burden of proof. In theory, scientists remain skeptical unless they can reproduce the results. Contrary to the storybook version of science, the scientific community does not normally go out of its way to refute incorrect results. It seems to take less time and energy simply to bypass erroneous material and allow it to fade into obscurity.55 Perhaps the most generally accepted metric of agreement is the number of positive citations a paper receives from other authors. Citation analysis provides readers access to the views of almost the entire scientific community on a paper.56 Thus, it captures what has been called the wisdom of crowds—the phenomenon in which the average of a collection of diverse independent assessors yields a more accurate assessment than any individual expert’s opinion.57 An important channel of communication that indicates acceptance, or lack thereof, is a letter to the editor. In theory, letters to the editor are important means for ensuring accountability of authors and editors; they might be considered to be a post-publication review process. A study of the response of authors of three randomized controlled trials to letters to the editor of The Lancet indicated that more than half the criticisms made went unanswered by authors. Moreover, important weaknesses in these trials were ignored in subsequently published practice guidelines.58 Acceptance, it seems, can occur even in the presence of known errors.

In general, the hypotheses that become accepted are those based not only on repeated studies but also on different kinds of studies that produce a converging line of evidence that leads to acceptance. Conversely, a single study that disagrees with a hypothesis may not be given much consideration; it is more likely that something is wrong in an isolated adverse study than in an army of confirming studies using diverse techniques. Unfortunately, accepting other scientists’ work as correct can be difficult for scientists who are emotionally disposed to believe that others’ contributions cannot be important. The geneticist Haldane59 outlined four stages on the road to acceptance: (1) this is worthless nonsense; (2) this is an interesting, but perverse, point of view; (3) this is true but quite unimportant; (4) I always said so. The observations and theories that remain are normally viewed as being correct and cumulative, each study adding—to use the cliché—another brick in humankind’s great wall of knowledge. In this view, science can be distinguished from the humanities, because its products are cumulative. For example, a recent graduate with a BSc in genetics knows incomparably more genetics than Mendel, who founded a quantitative approach to the topic. But few would claim that any modern playwright is the equal of Shakespeare. Science, in this view, gets progressively closer to objective truth. However, the view that science marches on guided by the logic of falsification and corroboration has been challenged in recent years, preeminently by Thomas Kuhn21 in his book The Structure of Scientific Revolutions. In examining the history of science, Kuhn found that new theories are often inferior to the ones they replace in their ability to explain a wide range of phenomena. Thus, scientific theories are not necessarily a closer approximation to the truth, and scientific growth is not continuous. Another problem for the storybook scientific method is that scientists are reluctant to reject a theory when there is evidence that it cannot explain or in the event that it is wrong. One explanation for this reluctance is that a theory is not rejected when negative empirical evidence is discovered unless there is a better theory to take its place. Although Kuhn’s views have received some degree of acceptance, working scientists—whom Kuhn regards as mere puzzle solvers—largely ignore his interpretation and continue to believe that as they attempt to falsify or corroborate theories, those unsupported by empirical data will be rejected. Cole60 speculates that it even may be necessary for scientists to believe in the traditional view to proceed with the work of science.

24

Brunette-CH02 SRM.indd 24

10/9/19 10:31 AM

Reproducibility and Openness

Individually, scientists feel compelled to publish, because rewards in science are allotted to those who first arrive at a particular result. The preoccupation of some scientists with establishing priority is not new and has involved many great scientists, including Newton, Hooke, Huygens, Cavendish, Watt, Lavoisier, Jenner, Furlong, Laplace, and Gauss.61 To establish his priority of observing the rings of Saturn for the first time, Galileo devised a scientific anagram that he included in a letter to Kepler: “Smaismrmilmepoetaleumibunenustlaviras,” which could be decoded into “altissimum planetan tergeminum observali” (I have observed the uppermost planet triple).62 Disputes over priority appear to be decreasing but, as witnessed by the Gallo-Montagnier dispute, are by no means dead. This may be due in part to the standard methods of publication, which record the date a paper is submitted. In some instances, such as the publication of the sequence of the human genome, scientists preempt priority disputes by coming to an agreement whereby the publications appear simultaneously. After a scientist publishes a paper, the scientist awaits the collective judgment. Sometimes it never comes; a large proportion of papers are never cited. A few papers become classics, in that they are cited extensively over a long time or included in textbooks, which are repositories of accepted knowledge (and which are typically at least 2 years out of date on publication).62 But even textbooks contain errors; journals such as Trends in Biochemical Sciences and Biochemical Education have sections on “textbook errors,” and readers must remain alert even when reading textbooks like this one.

Reproducibility and Openness While deliberate fraud perpetuated by scientists is a scandalous example of misconduct, a more pernicious problem now attracting attention appears to be the widespread failure of scientists to produce results that can be reproduced by others. As noted in the section on scientific method, repetition is one of the cornerstones of scientific method, and its omission can cast work of whole fields of research into doubt. It has long been recognized that uncertainties and errors can accompany even good experimental technique by highly competent research workers. Thus, the general practice in science is to report estimates of error in their publications. But the definitive test of the accuracy of claims is the independent replication of experiments and verification of observations. Such repetition is

necessary if the scientific community is to legitimately claim it provides reliable empirical knowledge of the external world. But obtaining reproducibility entails adopting a norm of openness. That is, if a paper does not specify exactly how observations or experiments were carried out, it would be difficult for other investigators to duplicate them. Standards for openness have been proposed such as by the Foundation for Peer to Peer Alternatives63 and the Center for Open Science.64 Some journals, such as Science, are revising their standards so that greater transparency such as accessibility to original data is ensured.65 The credibility of scientific knowledge is not guaranteed by just sophisticated and accurate techniques but also on the structures of social relationships that link scientific observers and, ideally, benevolently rule their good-faith interactions (discussed in detail by Ziman66,67). The social relationships and benevolent interactions that are necessary for consensus to be established, long thought to be a hallmark of scientists’ behavior, are now being tested by evidence of a lack of reproducibility across a wide range of disciplines. Moreover, the social interactions are now influenced by Darwinian struggles for research funds, patents, and commercialization opportunities that can distort proper scientific behavior.

Benevolent and malevolent behaviors There have long been concerns about “old boys’ clubs” and the like where it is perceived that an “in group” seeks to control funding and access to publication to further their own ends. But obtaining evidence for such malevolent behavior is difficult because decisions are often made confidentially, and the processes of approval or denial are not public. For example, typically grant applicants do not know exactly who said what about their submissions, and indeed they can spend considerable effort trying to figure it out so they can target their applications more effectively. A recent scandal of this type was the “Climategate” affair. A server at the Climatic Research Unit at the University of East Anglia was hacked in 2009 and thousands of emails and computer files were made public. This information resulted in the veil of secrecy being raised on some of the politics, strategies, and tactics of scientists involved in climate research. If Otto von Bismarck were alive today, he might modify his comment “Laws are like sausages—it’s best not to see them being made” to include climate science with laws and sausages. Some

25

Brunette-CH02 SRM.indd 25

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

of the released information was interpreted by climate change skeptics to indicate that there was a conspiracy to suppress critics of climate change and manipulate data. A number of scientific organizations investigated the matter but found no outright evidence of fraud or scientific misconduct. There were, however, obvious breaches of what most would consider the norms for interactions among scientists, particularly the norm of openness. Some reports urged the Climatic Research Unit to open access to the supporting data and methodology and promptly comply with freedom of information requests. The head of the unit, Professor Jones, perhaps described his behavior most accurately when he stated, as quoted in the popular press: “I have obviously written some pretty awful emails.”68 In one such email, he mentions a “trick” to massage figures and hide a decline in historical temperatures. It will be recalled that one of the characteristics of a reliable authority is one who has no axe to grind but clearly here there was quite a lot of grinding going on, which made many wonder about the authority of the experts involved.

Reproducibility crisis The term reproducibility crisis has been coined to indicate an urgent problem in current science. Articles documenting the issue include “The truth wears off” in The New Yorker,69 “Lies, damned lies, and bogus statistics” in The Atlantic,70 “How science goes wrong” in The Economist,71 and “Believe it or not: How much can we rely on published data on potential drug targets” in Nature Reviews Drug Discovery.72 Moreover, institutions are scrambling to identify causes and possible solutions to the irreproducibility problem as evidenced by articles such as “Estimating the reproducibility of psychological science” in Science,73 “NIH mulls rules for validating key results” and “Cancer trial errors revealed” in Nature,74,75 and “Rescuing US biomedical research from its systematic flaws” from the Proceedings of the National Academy of Sciences of the United States of America.76 These articles indicate that both the public and professional scientists are worried about the degree to which society can trust scientific findings.

Seriousness of the problem Lack of reproducibility is a problem over a wide range of investigational approaches and topics from psychology (39 of 100 studies could be reproduced73) to cancer biology. In an in-house test of reproducibility under

ideal conditions (ie, same laboratory, same people, same tools, and same assays) to confirm published data on drug-target validation, only 20% to 25% of the published findings were completely in line with their in-house findings.72 Publication in a prestigious journal did not ensure reproducibility: The reproducibility of published data did not significantly correlate with journal impact factors. The traditional safeguards of replication and rigorous peer review of high-impact journals to preclude the publication and impact of error are apparently not effective. Poor reliability in studies would be expected to lead to a loss of faith in the published literature. Early-stage venture capital firms, for example, have an unspoken rule that “at least 50% of published studies, even those in top-tier academic journals can’t be repeated with the same conclusions by an industrial lab.”77 The practical seriousness of what has been called the irreproducibility epidemic or reproducibility crisis can be inferred from the fact that a drug company is establishing an online journal for studies that attempt to replicate experimental results.78 The Reproducibility Initiative funded by the Arnold Foundation79,80 was launched to independently and systematically validate fifty of the highest-impact studies in cancer biology. Funding agencies are also concerned; the NIH is changing procedures for evaluating grants to ensure that reporting of technical details and exact experimental design elements are reported.81,82 The NIH is also encouraging the research community to take the steps necessary to reset the self-corrective process of scientific investigation.82

The multifactorial reasons for the reproducibility crisis Little incentive to reproduce findings and perverse incentives to publish quickly Formal tests of reproducibility of others’ findings are rare and generally unprofitable for investigators to undertake. First, to formally reproduce findings requires that all the elements of the experiment be repeated in the exact same manner as the original investigation. That requirement can be expensive and time consuming for investigators because laboratories vary in the details of their investigative approach. Moreover, the rewards of science, such as recognition, go overwhelmingly to those first reporting a result and not to those who confirm findings.83 Second, negative results, such as failure to repeat findings, are difficult to publish because there are generally multiple possibilities of why an experiment

26

Brunette-CH02 SRM.indd 26

10/9/19 10:31 AM

Reproducibility and Openness

failed. Therefore, as a general rule, negative citations are rare in the literature. As there is little incentive for repeating others and considerable time and expense to do so, formal studies of the reproducibility of results are fewer than would be expected in the traditional description of scientific method. I think investigators do test out the ideas of others informally in attempts to utilize techniques or concepts that might be profitable in their own research program. If successfully employed, the paper underlying the concept or technique might be cited. But if the methods or ideas do not prove useful, there will no publication and no citations. The number of such informal failed attempts at replication is thus unknown but likely to be considerable. Most scientists would have tried out some ideas that didn’t work as advertised in their hands. In contrast to the lack of incentive to reproduce results, there are “perverse incentives” to publish quickly, preferably in high-impact journals. The pharmaceutical industry apparently has some influence on investigators, but university departments themselves have been known to offer incentives including promotion, tenure, and cash rewards for high-impact productivity. Institutions can also pressure individuals to produce patents, and prospects for employment or continued employment may well depend on productivity that is demonstrated quickly. 84 Given such pressures, it is understandable that some vulnerable investigators may adopt a “publish uncertain work and be damned” strategy particularly, when the probability of detection and thus damnation is low. Defective reporting of methodology If research reports are not unambiguously clear on the methods and materials used, it is difficult to imagine researchers being able to guess what is necessary to make an experiment work, and lack of reproducibility will ensue. Accordingly, the NIH has stressed the need for rigor and reproducibility in biomedical research and improving the reporting of animal experiments.85 Various measures were recommended, including the estimation of sample size, the details of the randomization procedure, and the blinding of investigators to treatments. These recommendations do not constitute any new insight. Indeed, a paradox has arisen that lack of reproducibility is less of a problem in human clinical trials than in preclinical work where in theory the investigator has more control over the conditions of the experiment and experimental units because clinical trials are governed by regulations that stipulate rigorous design and independent oversight.

It appears that journals and investigators have not given the materials and methods section the attention and respect that should be its due given its crucial importance in enabling reproducibility; this section often seems to be cut and pasted from previous papers. As materials and methods do change with time, the cut-and-paste practice can lead to a phenomenon I call methodological drift whereby the methods as presented are not the methods that were actually currently practiced but rather reflect past practices. For example, suppliers and batch numbers may change over time. Recently in my own lab we experienced problems in replicating results, and the culprit turned out to be a supplier of a polymer system that polymerized incompletely under the conditions we previously employed. The result was that toxic molecules were leached out into the cell culture media. On tracing back into the supply chain, we found that the problems started when a new batch of the polymer system was introduced. Another problem is that the reagents relied upon in modern biomedical experimentation have become sophisticated to the point that individual investigators have difficulty assessing their quality and often assume the materials are what the manufacturers represent them to be. Specific antibodies for example are central to many assays, and Science Exchange, a company that matches scientists with verification service providers, has launched a program to independently validate research antibodies.74 Elements sometimes overlooked in reporting methodology include specification of the experimental design itself including the procedures for randomization, sample size calculations, and the effect of sex differences. Indeed, on the grant committees of the Canadian Institutes of Health Research, it is now required to explicitly consider sex and gender differences where relevant, and committee members have to complete a small web-based course to familiarize themselves with the issues. Some journals are working to improve reporting practices; Nature for example, has updated their guidelines, introduced checklists,75 and relaxed the limitations in length for reporting methodology. Institutional efforts to deal with the reproducibility problem have been deemed as insufficient. Begley et al84 state that “conspicuous by their absence from these efforts [in improving reproducibility] are the places in which science is done: universities, hospitals, government-supported labs, and independent research institutes. This has to change. Institutions must support and reward researchers who do solid—not just flashy—science and hold to account

27

Brunette-CH02 SRM.indd 27

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

those whose methods are questionable.” In this view, science needs less “Flash Stans” and more “Steady Eddies.” Hypercompetition Alberts et al76 have argued that US biomedical research suffers from systemic flaws, particularly an unsustainable hypercompetitive system. The percentage of recent PhDs in academic positions has fallen to around 20%, and the percentage of successful applications in NIH grant competitions is in the low teens. Reproducing findings necessarily slows down publications and under the current conditions could be detrimental to a career. A survey of US biomedical trainees conducted at the prestigious MD Anderson Cancer Center found nearly 1 in 5 felt pressure to publish uncertain findings, over 3 in 10 felt pressure to support a mentor’s hypothesis even when data did not support it, and nearly one-half knew of mentors who required lab members to have a high-impact publication before moving on.84 Such practices would tend to propagate to these young scientists when they themselves become supervisors, for “as the twig is bent so grows the tree.” In addition, pressure is heaped on the many investigators who draw salary support from grants and need grants to be renewed to retain employment. Alberts et al76 and Begley et al84 both make a number of recommendations on improving institutional and granting practices, but the feasibility of such approaches has not been established.

Plans to enhance reproducibility Given that self-correction is one of the foundations of scientific integrity but appears to be failing, a number of approaches have been proposed as remedies by NIH leadership and others. Suggestions include modifying peer review, requiring independent labs to validate results of important preclinical studies, contracting out validation, developing training programs that include a module on enhancing reproducibility, implementing good experimental design, and emphasizing transparency of research findings.82 The feasibility of contracting out validation studies has been questioned on grounds of expense as well as the possibility of meticulous researchers being tarred with the brush of irreproducibility when the faults lay not with them but with the inadequacy of independent validators who lack the experience or materials to replicate results in complex systems.85

Tracing of personal accountability In the current scientific world where many publications have multiple authors, it is difficult to determine the contributions of specific individuals to the paper and where accountability should reside for specific findings or data. This problem became evident to me when I was serving as a consultant on the selection of a professor for a European institution in which several of the applicants had collaborated with each other and also with other scientists from the same institution. It was clear that some excellent work had been done, but it was not clear how the credit for the work should be distributed. On the other side of this recognition coin, if data proved to be unreliable, it would be difficult to determine who was to blame. The NIH is contemplating modifying the format of its “biographical sketch” to delineate the part played by the applicant in multiauthored projects. Collins and Tabak86 state that the NIH is firmly committed to making systematic changes that should reduce the frequency and severity of the reproducibility problem, but caution that “success will come only with the full engagement of the entire biomedical-research enterprise.”

Some comments on the reproducibility problem Scientific misconduct and the lack of reproducibility of some findings pose definite problems to scientific progress. In particular, science becomes less efficient if research funding is not put to the good use of producing reliable information, and application of dubious information drives out the opportunity for employment of better alternatives. In general, however, the scientific method and the safeguard of repetition eventually lead to bad information being weeded out from fruitful science. Indeed, our day-to-day experience confirms the efficacy of science in building new knowledge that improves the human condition. Dentistry, for example, steadily incorporates new technology, improved procedures, and advanced biomedical information in diagnosis and treatment. For example, recently I underwent endodontic treatment because computed tomography identified a problem that could not be discerned using standard bitewing radiographs. The problem for individual scientists and clinicians is to distinguish the wheat from the chaff. Critical thinking and close evaluation of scientific papers before accepting the purported findings is the first step. As noted, some fraudulent published work could be identified as

28

Brunette-CH02 SRM.indd 28

10/9/19 10:31 AM

References

such by any careful reader, because reported numeric results did not add up or were inconsistent. Similarly close attention to methodology would lead to some published work being dismissed as subject to alternative interpretations, and its application to patients premature. The existence of the reproducibility crisis, which involves even high-impact journals with rigorous reviewing, reinforces the principle that repetition is a cornerstone of the scientific method. Repetition takes time, and haste to apply an exciting result in research or clinical practice may well lead to waste. Thus, scientists and clinicians need to keep up-to-date on the literature to determine how well concepts and findings are being reproduced and accepted by publishing scientists and clinicians.

Positivism Versus Conventionalism The approach to scientific method outlined in the previous pages is based on a positivist view of the world. In this view, there are laws of nature that are empirical statements describing the real world. Statements made about this world can be either true or false as a matter of fact. Thus, a positivist would hold that there really are such things as atoms or viruses or complex peptides such as thyrotropin-releasing hormone (TRH). It is this view that predominates in biomedical, biologic, and most natural sciences. On occasion, however, this model breaks down. In vacuumtube electronics, electrons are treated as particles, whereas in good crystalline solids, they must be treated as waves.87 Positivists might have trouble answering the direct question: Is an electron a wave or a particle? The question can be answered through description of the electron in the terms of quantum mechanics, but this is a description that is a long way from our ordinary conception of reality that includes objects like chairs or doorknobs that can be described in terms of direct sense impressions. Conventionalists argue that it does not matter whether the laws of nature are true or false but rather under what conditions they provide the most economical, fruitful, and illuminating picture of reality. In this view, a successful experiment shows that a certain way of describing the world is useful. Conventionalism assigns to scientific knowledge no higher status than that of being a useful hypothesis.88 When a conventionalist anthropologist visited the laboratory of Nobel laureate Guilleman, who was studying TRH, he portrayed the laboratory’s

activities as “organization of persuasion through literary inscription.” He would describe TRH as a fact that was constructed on the basis of inscriptions provided by various instruments, the interpretive and persuasive powers of the scientists in the laboratory, and the acceptance of those interpretations in the larger community of endocrinologists.88 By contrast, a positivist biochemist would think of the laboratory’s work as isolating and sequencing a particular molecule.

Opportunities for conventionalism in dental research: The case for qualitative studies The vast majority of biochemists would be positivists, but in the social sciences conventionalists would perhaps form the majority. The social sciences deal with constructs—unobservable, constructed variables used to label patterns of observable variables. Examples of constructs include dental aptitude, socioeconomic status, and intelligence. Statistical procedures, such as factor analysis and canonical correlations, have been developed to locate constructs. In an editorial in the Journal of Dental Research, Chambers89 argued that dental research should shed some of its positivist outlook centered on life sciences and embrace the methods of the social sciences. Educational research, for example, has experienced a blossoming in qualitative studies. Despite the methodologic advances, qualitative field research does not figure as largely as it might in dental research. It seems that various aspects of dentistry could best be investigated by this approach, in particular, providing evidence and theories that enable health professionals to understand their clients better as people.

References 1. 2. 3.

4.

Medawar PB. The Limits of Science. Oxford: Oxford University, 1984:16. Bronowski J. The Common Sense of Science. New York: Vintage, 1951:97–118. Kuhn TS. Normal science as puzzle-solving. In: Kuhn TS (ed). The Structure of Scientific Revolutions, ed 2. Chicago: University of Chicago,1970:35–42. Grinnell F. Discovery: Learning new things about the world. In: Grinell F (ed). Discovery in Everyday Practice of Science: Where Intuition and Passion Meet Everyday Logic. Oxford: Oxford University, 2009:21–58.

29

Brunette-CH02 SRM.indd 29

10/9/19 10:31 AM

SCIENTIFIC METHOD AND THE BEHAVIOR OF SCIENTISTS

5.

6.

7. 8.

9. 10. 11.

12. 13.

14. 15. 16. 17. 18. 19.

20. 21.

22. 23.

24.

25. 26. 27. 28. 29.

30.

Merton RK. Science and democratic social structure. In: Merton RK (ed). Social Theory and Social Structure. London: Collier Macmillan, 1968:604–615. Ziman J. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge: Cambridge University, 1984:84. Judson HF. The Baltimore affair. In: Judson HF (ed). The Great Betrayal: Fraud in Science. Orlando: Harcourt, 2004:242. Merton RK. Reference groups, invisible colleges, and deviant behaviour in science. In: O’Gorman HJ (ed). Surveying Social Life: Papers in Honor of Herbert H. Hyman. Middletown, CT: Wesleyan University, 1988:174. Seventy-five reasons to become a scientist. Am Sci 1988;76:450– 463. Latour B, Woolgar S. Laboratory Life: The Social Construction of Scientific Facts. Beverly Hills: Sage, 1979. Cronin B. The Citation Process: The Role and Significance of Citations in Scientific Communication. London: Taylor Graham, 1984:20. Simonton DK. Creativity in Science: Chance, Logic, Genius, and Zeitgeist. Cambridge: Cambridge University, 2004:172–173. Surowiecki J. The Wisdom of Crowds: Why the Many are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations, ed 1. New York: Doubleday, 2004:162–163. Axelrod R. The Evolution of Cooperation. New York: Basic Books, 1984:174. Hearnshaw LS. Cyril Burt, Psychologist. New York: Vintage Books, 1981:70. Hearnshaw LS. Cyril Burt, Psychologist. New York: Vintage Books, 1981:288. Wainer H. Graphic Discovery: A Trout in the Milk and Other Visual Adventures. Princeton: Princeton University, 2005:6. Wade N. Scandal in the heavens: Renowned astronomer accused of fraud. Science 1977;198:707. Brush SG. Should the history of science be rated X? The way scientists behave (according to historians) might not be a good model for students. Science 1974;183:1164. Westfall RS. Newton and the fudge factor. Science 1973;179:751. Kuhn TS. The resolution of revolutions. In: Kuhn TS (ed). The Structure of Scientific Revolutions, ed 2. Chicago: University of Chicago, 1970:144–159. Bondi H. Fact and inference in theory and in observation. Vistas Astronomy 1955;1:155. Judson HF. What’s it like? A typology of scientific fraud. In: Judson HF (ed). The Great Betrayal: Fraud in Science. Orlando: Harcourt, 2004:44–48. Friedman DH. Wrong. Why Experts* Keep Failing Us—And How to Know When Not to Trust Them. New York: Little, Brown and Company, 2010:255–257. Jon Sudbø. Wikipedia. https://en.wikipedia.org/wiki/Jon_Sudbø. Accessed 15 April 2019. BBC News. Cancer study patients “made up.” http://news.bbc. co.uk/2/hi/health/4617372.stm. Accessed 15 April 2019. Broad W, Wade N. Betrayers of the Truth: Fraud and Deceit in the Halls of Science. New York: Simon and Schuster, 1982. Kohn A. False Prophets: Fraud and Error in Science and Medicine. Oxford: Blackwell, 1986. Mosteller F. Evaluation: Requirements for scientific proof. In: Warner KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:103–121. Gordon M. Evaluating the evaluators. New Sci 1977;73:342.

31. Martinson BC, Anderson MS, de Vries R. Scientists behaving badly. Nature 2005;435:737–738. 32. Doty P. Cited by: Judson HF (ed). The Great Betrayal: Fraud in Science. Orlando: Harcourt, 2004:224. 33. Rousseau DL. Case studies in pathological science. Am Sci 1992;80:54–63. 34. Davenas E, Beauvais F, Amara J, et al. Human basophil degranulation triggered by very dilute antiserum against IgE. Nature 1988;333:816–818. 35. Hetherington NS. Just how objective is science? Nature 1983;306:727. 36. Rustia M, Shubik P. Experimental induction of hepatomas, mammary tumors, and other tumors with metronidazole in noninbred Sas:MRC(WI)BR rats. J Natl Cancer Inst 1979;63:863. 37. Medawar PB. The Limits of Science. Oxford: Oxford University, 1984:60. 38. Okasha S. Philosophy of Science: A Very Short Introduction. Oxford: Oxford University, 2002:58. 39. Ziman JM. Reliable Knowledge: An Exploration of the Grounds for Belief in Science. Cambridge: Cambridge University, 1978:42. 40. Polgar S, Thomas SA. Introduction to Research in the Health Sciences, ed 2. Melbourne: Churchill Livingstone, 1991:97–98. 41. Polanyi JC. Free discovery from outside ties. Can Assoc Univ Teachers Bull 2002;49(10):A3. 42. Medawar PB .The Art of the Soluble: Creativity and Originality in Science. Harmondsworth: Penguin, 1969:97. 43. Simon R, Wittes RE. Methodologic guidelines for reports of clinical trials. Cancer Treat Rep 1985;69:1. 44. Bailar JC 3rd, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations. Ann Intern Med 1988;108:266. 45. Chilton NW, Barbano JP. Guidelines for reporting clinical trials. J Periodontal Res Suppl 1974;14:207. 46. Sackett DL. Evaluation: Requirements for clinical application. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:123. 47. A co-operative trial in the primary prevention of ischaemic heart disease using clofibrate. Report from the Committee of Principal Investigators. Br Heart J 1978;40:1069. 48. Mason SF. A History of the Sciences. ed 2. New York: Collier Macmillan, 1962:334. 49. Cole S. The Sociological Method: An Introduction to the Science of Sociology, ed 3. Boston: Houghton Mifflin, 1980:121. 50. Ziman J. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge: Cambridge University, 1984:10. 51. Gushee DE. Reading behavior of chemists. J Chem Doc 1968;8:191. 52. Holton G. Quanta, relativity, and rhetoric. In: Pera M, Shea WS (eds). Persuading Science. The Art of Scientific Rhetoric. Sagamore Beach, MA: Science History, 1991:174. 53. Medawar P. Is the scientific paper a fraud? In: The Strange Case of the Spotted Mice. Oxford: Oxford University, 1996:33–39. 54. Medawar P. Is the scientific paper a fraud? The Listener. BBC Third Programme. 12 Sept 1963. 55. Meadows AJ. Communications in Science. London: Butterworths, 1974:45. 56. Cronin B. The Citation Process: The Role and Significance of Citations in Scientific Communication. London: Taylor Graham, 1984:79. 57. Surowieki J. The Wisdom of Crowds. New York: Doubleday, 2004.

30

Brunette-CH02 SRM.indd 30

10/9/19 10:31 AM

References

58. Horton R. Postpublication criticism and the shaping of clinical knowledge. JAMA 2002;287:2843. 59. Haldane JBS. The truth about death. J Genet 1963;58:464. 60. Cole S. The Sociological Method: An Introduction to the Science of Sociology, ed 3. Boston: Houghton Mifflin, 1980:129–130. 61. Merkton RK. Priorities in scientific discovery: A chapter in the sociology of science. Am Sociol Rev 1957;22:635. 62. Meadows AJ. Communication in Science. London: Butterworths, 1974:57. 63. P2P Foundation Wiki. Openness in Science. https://wiki.p2pfoundation.net/Openness_in_Science. Accessed 3 June 2019. 64. Center for Open Science. Transparency and Openness Promotion Guidelines. https://cos.io/top. Accessed 3 June 2019. 65. McNutt M. Reproducibility. Science 2014;343:229. 66. Ziman J. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge: Cambridge University, 1984. 67. Ziman J. Reliable Knowledge: An Explanation for the Grounds for Belief in Science. Cambridge: Cambridge University, 1991. 68. British scientist in climate row admits “awful” emails. Phys.org. Published 10 March 2010. https://phys.org/news/2010-03-british-scientist-climate-row-awful.html. Accessed 3 June 2019. 69. Lehrer J. The truth wears off. The New Yorker. Published 5 December 2010. https://www.newyorker.com/magazine /2010/12/13/the-truth-wears-off. Accessed 3 June 2019. 70. Fournier R. Lies, damned lies and bogus statistics. The Atlantic. Published 5 May 2014. https://www.theatlantic.com/politics/ archive/2014/05/lies-damned-lies-and-bogus-statistics/ 460980/. Accessed 3 June 2019. 71. How science goes wrong. The Economist. Published 21 October 2013. https://www.economist.com/leaders/2013/10/21/howscience-goes-wrong. Accessed 3 June 2019. 72. Prinz F, Schlange T, Asadullah K. Believe it or not: How much can we rely on published data on potential drug targets? Nat Rev Drug Discov 2011:10:712. 73. Open Science Collaboration. Psychology. Estimating the reproducibility of psychological science. Science 2015:349:aac4716. 74. Wadman M. NIH mulls rules for validating key results. Nature 2013:500:14–16. 75. Samuel Reich E. Cancer trials errors revealed. Nature 2011:469:139–140.

76. Alberts B, Kirschner MW, Tilghman S, Varmus H. Rescuing US biomedical research from its systemic flaws. Proc Natl Acad Sci U S A 2014:111:5773–5777. 77. Osherovich L. Hedging against academic risk. SciBX 2011;4(15). doi:10.1038/scibx.2011.416. 78. Laguipo A. Drug company Amgen launches online journal for studies that attempt to replicate experimental results. Tech Times. Published 5 February 2016. https://www.techtimes.com/ articles/130935/20160205/drug-company-amgen-launchesonline-journal-for-studies-that-attempt-to-replicate-experimental-results.htm. Accessed 3 June 2019. 79. Arnold Foundation awards $1.3 million to validate cancer studies. Philanthropy News Digest. Published 13 October 2013. https:// philanthropynewsdigest.org/news/arnold-foundation-awards1.3-million-to-validate-cancer-studies. Accessed 3 June 2019. 80. Science Exchange. Reproducibility Project: Cancer Biology. http://validation.scienceexchange.com/#/cancer-biology. Accessed 3 June 2019. 81. Baker M. First results from psychology’s largest reproducibility test. Nature 2015. doi:10.1038/nature.2015.17433. 82. National Institutes of Health. Rigor and Reproducibility. https:// www.nih.gov/research-training/rigor-reproducibility. Accessed 3 June 2019. 83. Meadows AJ. Communication in Science. London: Butterworth, 1974:54. 84. Begley CG, Buchan AM, Dirnagl U. Robust research: Institutions must do their part for reproducibility. Nature 2015;525:25–27. 85. National Institutes of Health. Improving the Quality of NINDS-Supported Preclinical and Clinical Research Through Rigorous Study Design and Transparent Reporting. https://www.ninds.nih.gov/ sites/default/files/transparency_in_reporting_guidance_1.pdf. Accessed 3 June 2019. 86. Collins FS, Tabak LA. Policy: NIH plans to enhance reproducibility. Nature 2014:505:612–613. 87. Ziman J. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge: Cambridge University, 1984:43–90. 88. Latour B, Woolgar S. Laboratory Life: The Construction of Scientific Facts. Princeton: Princeton University, 1986. 89. Chambers D. The need for more tools. J Dent Res 1991;70:1098– 1099.

31

Brunette-CH02 SRM.indd 31

10/9/19 10:31 AM

3 The Components of a Scientific Paper



. . . the research report represents an extended argument in which researchers seek to convince readers that their research questions are important, their methods were sensibly chosen and carefully carried out, their interpretations of their findings sound, and their work represents a valid contribution to the developing field.” ANN M. PENROSE AND STEVEN B. KATZ 1

Active Reading and the Scientific Paper

I

think most professionals would agree that keeping up-to-date on current research is important. However, there are two major constraints. The first is that of time; reading and understanding the research literature consumes time that can often be more pleasurably used in other pursuits. Secondly, reading research can be tough, involving diverse barriers from accessing the relevant literature to understanding concepts rarely used in daily life. The intent of this book is to make reading and understanding research both easier and more efficient by introducing a systematic approach to understanding and evaluating research articles. It advocates the adoption of the principle of active reading whereby the reader forms expectations of what is provided in each section and determines the adequacy of the information presented. Early detection of a serious problem can eliminate an article from detailed consideration and thus save time. Often, the primary goal of reading a research article is simply to understand the basic message delivered by the author. This limited approach has the consequence that such a reader is likely to accept the author’s agenda and less likely to consider other alternatives. However, active inquiry and critical thinking help readers to effectively understand the strengths and weaknesses of a study and to assess if the study design is sufficiently valid that the results should be considered when making treatment decisions. Critical thinking as applied to the scientific literature encompasses a variety of approaches that are centered on the central problem of assessing the truth and applicability of articles’ conclusions. Active reading of a research article entails knowing common pitfalls in various types of research approaches and identifying whether they are problematic in the particular article under consideration. For example, much research involves taking samples from populations, and questions arise of how representative the samples are

32

Brunette-CH03.indd 32

10/9/19 10:33 AM

Components and Their Functions

and whether they are large enough to detect differences between various populations or treatments.

Components and Their Functions Scientific papers are traditionally divided into components, each of which has a different function. Journals vary in the order in which components are arranged, but most journals have a materials and methods section placed before the results. The journal Science, however, prints the experimental details as footnotes at the end of the article. Other journals do not print some detailed information but rather relegate it to online access only. The functions of the components are outlined below.

Title The purpose of the title is to indicate the subject of the study. Huth2 states that there are two types of titles: an indicative title tells what the paper is about, whereas an informative title states the message of the paper. “The effects of surface topography on cell behavior” would be an indicative title. “Grooved implant surfaces inhibit epithelial downgrowth” would be an informative title. A third class of title might be called the “sales” title, which promises some benefit to the reader. Titles like “A rapid method for isolating plasma membranes from mouse L cells using two phase systems” is promising a benefit to readers who isolate membranes, and “A reliable method for quantifying band location on human chromosomes” promises a benefit to those mapping chromosomes. In an era where citations are important to careers, it makes sense to entice readers, who are potential citers, to read your paper. Watson and Crick3 published their seminal one-page article with the understated title “Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid”; they obviously didn’t feel any need to promote that research. But earlier in their careers they demonstrated enthusiasm with little supporting data; their constant chatter about helices had provoked the eminent biochemsist Erwin Chargoff to comment sardonically in his notebook “two pitchmen in search of a helix.”4 A well-written title is informative, concise, and graceful.5 It attracts the interest of readers while giving them the topic of the study. The current trend is to present the conclusion in the title as, for example, “Dilantin causes gingival overgrowth,” rather than “Studies on

the effects of drugs on human gingival tissues. Part XII. Dilantin.” On reading the title, the reader should ask why the study was done in the first place, bearing in mind that often what gets studied depends on what gets funded. Agencies fund what interests them. Commercial interests fund research to provide credibility for their products. In such instances, the research may be providing facts for only one side of the argument.

Authors Today many studies are collaborative affairs, and one of the first decisions is to decide who among the collaborators should be an author. The International Committee of Medical Journal Editors (ICMJE6) has established guidelines on who qualifies for authorship. Four conditions must be met: 1. Substantial contributions to the conception and design or acquisition of data or analysis and interpretation of data 2. Drafting the article or revising it critically for important intellectual content 3. Final approval of the version to be published 4. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved ICMJE also states, “Examples of activities that alone (without other contributions) do not qualify a contributor for authorship are acquisition of funding; general supervision of a research group or general administrative support; and writing assistance, technical editing, language editing, and proofreading. Those whose contributions do not justify authorship may be acknowledged individually or together as a group under a single heading.” The above guidelines, in my view, are somewhat fuzzy (for example, what actually does “substantial” mean in operational terms?). Moreover, my observation of how research groups actually work suggest that more flexible criteria operate in practice. The leader of a large research group may have outlined a general approach to a particular topic (substantial contribution?), but it is the troops at the lab benches who actually make the operational decisions that make the project feasible. Some of the authors may have English as a second language and quite frankly would be unable to understand the nuances of the arguments

33

Brunette-CH03.indd 33

10/9/19 10:33 AM

THE COMPONENTS OF A SCIENTIFIC PAPER

in the paper. Modern research also involves collaboration with specialists, and some authors are not able to evaluate the data coming from their collaborators. Moreover, there is the political problem that if the head of the lab is not placed on papers, his or her productivity may be questioned when renewal of funding is requested. Another issue that can generate considerable discord among collaborators is the order of the authors. The general rule is that the first author is the person who did the work and prepared the first draft of the paper, and the last author is the senior author who leads the research group and probably obtained the funding. The intervening authors (second, third, fourth, etc) are in order of their contribution to the work. The general practice at present seems to be that authorship is distributed rather generously, sometimes in return for very little contribution. That generosity may become somewhat curtailed now that there are indices that adjust citation counts according to number of authors on the papers. The universal advice is that such thorny issues as inclusion and placement of author be ironed out early in the journey to publication. A useful resource for dealing with this potential problem is an editorial published in Biological Conservation that details how to avoid publishing conflict among authors and a proposed agreement for coauthor teams.7 Scientists, like horses, come in various classes and run in more or less distinguished company. Their publication productivity varies, and they publish in journals of varying impact. The problem is distinguishing the thoroughbreds from the plough horses from the rogues. A technique called citation analysis, using the Web of Science, the Science Citation Index, as well as other programs such as Google Scholar—which may be likened to the Daily Racing Form in this analogy—is one approach to this problem. Previously unpublished authors present a different challenge, as noted by Sackett,8 who recommends that the work of unknown scientists, like the work of unknown sculptors, deserves at least a passing glance. Right below the list of authors is their institutional affiliations; where the study was done has been found to carry some weight with referees. The author’s address enables readers to write for reprints, clarification, or discussion of the paper. It will sometimes give clues to the assumptions or traditions behind the research. For example, consider a paper on dental implants published by a group in Gothenburg, Sweden, the site where the Brånemark system was developed. The reader can be confident in a firmly embedded

assumption that the most desired form of integration of the implant is through a bony interface. With the growth of multiauthored papers, the question of what authorship signifies has arisen. At one extreme is the view that an author can justify intellectually the entire contents of an article. But that is a difficult standard to uphold in the world of multidisciplinary research, where a clinician might have trouble justifying complex statistical procedures or where a statistician is held responsible for the molecular biologic procedures. Some medical journals require contributors to state explicitly what part they played in the research. Considering the complex negotiations that can attend the production of a multiauthor paper and the changes wrought by referees one might expect that the eventual published article would only rarely represent the full range of opinions of the coauthors.

Date of submission and acceptance Ostensibly, these dates are given to establish priority of discovery to the authors. A long delay between submission and acceptance may indicate that referees found serious problems in the initial version and that extensive rewriting was required, or it may indicate inefficiency in the editors, referees, or authors. Many journals have moved from paper-based to electronically based submission and review of articles, and, as a general rule, this change reduces the time between submission and publication. Some variables are associated with the date and location of the study. For example, the incidence of dental caries in North America will probably be less in the 21st century than it was in the mid 20th century, and this might have an effect on the results of some types of study.

Abstract or summary Function Put simply the abstract summarizes the purpose of the study, what was done, what was found, and what was concluded. For a journal article the abstract has to be written as a stand-alone, one-paragraph component. It will be the most read and first read part of the article as readers typically read the abstract to decide whether they will read the whole paper. Some literature retrieval programs (eg, Medline, Web of Science), publish the abstract just as it is in the paper; if the

34

Brunette-CH03.indd 34

10/9/19 10:33 AM

Components and Their Functions

authors don’t produce an abstract, these services will not produce one. Abstracts also need to be prepared for scientific meetings (such as for the International Association for Dental Research) where they will be reviewed for inclusion in the conference. As it is a stand-alone production, the abstract needs to incorporate all the elements of the paper, including the problem being investigated, the objectives of the study, the methods used, a summary of the results, and the conclusions. Being the first element read, the abstract frames reviewers’ attitudes to the paper as a whole. It must attract interest and at the same time be written clearly and logically. As is often the case, first impressions are important. It is generally recommended that the abstract be the last section written. W. M. Thomsen, University of Otago, an experienced editor and prolific author in dental epidemiology, recommends a simple method of preparing an abstract in which key sentences from the various sections are cut and pasted into the abstract and then edited (personal communication). For example, the abstract could include one or two sentences for the objectives from the introduction, two sentences emphasizing the study design and measurements from the methods, two to four sentences that typically highlight summaries of the data from the results, and four sentences from the discussion. This approach ensures that the main points will be covered, and as it uses materials that have already been considered carefully, will probably require little editing. That advice to build the abstract pre-fab style out of previously existing components is not universally recommended, however, possibly because it can lead to a somewhat choppy style. Although the rule of not including references to the literature in the abstract generally applies in primary journals, on occasion the rule is not enforced for conference abstracts that might also include tables or even figures. If in doubt check the abstracts from previous conferences of the organization. 

Structured abstracts Structured abstracts are abstracts that have distinct labeled sections such as methods, results, and conclusions. It is thought that structured abstracts enable a more rapid comprehension of the abstract, and there is evidence that it aids in computer searching of abstracts for Medical Subject Headings (MeSH) terms and thus might make identification of your article by computer searching more effective. Other benefits include guiding authors in summarizing the content of their

manuscripts precisely and facilitating the peer review process for manuscripts submitted for publication, More information on structured abstracts can be found at structuredabstracts.nlm.nih.gov and nlm.nih. gov/bsd/policy/structured_abstracts.html. The form of the structured abstract is tailored to the type of research being reported. Most will start with a short introduction or objectives section written in the present tense, containing the specific field of inquiry, the hypothesis, and purpose of the study. A list of keywords is found at the end. The intervening sections of the structured abstract can vary as they generally mimic the sections of the journal article. For example, a case report might have sections on differential diagnosis, treatment, uniqueness, and conclusions, whereas a report on a clinical technique might have sections on background, description, and clinical advantages. Items related to methods and results are written in the past tense, whereas the discussion is written in the present tense. The abstract should tell the reader why the study was done, what was done, what was found, and what was concluded, whereas a summary usually focuses on the principal findings and conclusion of the study.9 Although often overlooked, the abstract should be studied closely by anyone interested in either seriously evaluating a paper or saving time. For example, the abstract should contain the information that enables the reader to answer the Fisher assertibility question (AQ),10 which asks, “What information would I require to accept the conclusions?” If the methods used in the paper do not provide the appropriate information, the reader may wish to skip reading the paper in closer detail, thus saving time. In answering the AQ, one has to make a judgment about appropriate standards—that is, how much evidence is required to prove the point. Too severe a standard will suggest that nothing could be known with certainty and nothing is worth reading, but, conversely, lax standards will lead to error and gullibility.10 Having read the abstract and determined that the paper merits attention, the reader can consider the contents with the author’s conclusion in mind and is thus in a better position to identify weaknesses. In a paper reporting negative results, such as a treatment that did not produce a beneficial effect, a critical question is the size of the sample and the power of the statistical test. On encountering a negative result in the abstract, the reader is forewarned to pay particular attention to the size of the sample. The AQ, as well as analysis of the relationship between evidence and conclusion, falls into the domain of logic, which is covered in chapters 6 to 9. The logic of statistical

35

Brunette-CH03.indd 35

10/9/19 10:33 AM

THE COMPONENTS OF A SCIENTIFIC PAPER

inference is of sufficient importance that it is featured in three chapters: 10, 11, and 21. For medical journals, there has been a movement to structured abstracts, which are required in the CONSORT guidelines adopted by medical editors for reports of randomized controlled trials (RCTs). A structured abstract provides the objective, design, setting, subject, interventions, outcomes, results, and conclusions of a study. In general, structured abstracts provide more—and more easily accessible—information than unstructured abstracts and aid investigators preparing systematic reviews.

Introduction Purposes of the introduction The introduction has several purposes. First it explains why the study was done and secondly it entices the journal’s audience to read it by demonstrating that the question of interest is important and that new information on this interesting question will be presented. The question of interest may be stated as a problem or a hypothesis. A third purpose might be described as self-promotion; the authors desire to proclaim that they are competent by virtue of the work they have done previously on this topic or a related one.

Structure of the introduction First the context of the study is briefly presented to orient the reader. The current practice is that a comprehensive literature review is not presented in the introduction, but rather the author includes sufficient background information, such as the most prominent work of others on the topic, to allow the reader to understand why the study was done without the reader having to look up previous publications. Because research in any research group tends to build on previous work, the cited publications will often include some of the authors’ publications. The nature and scope of the problem to be investigated is presented, in particular how the study fills gaps in knowledge that need to be filled for the field to advance. Typically, readers rely on the author to do a competent job in telling it like it is, but authors vary in their competency so misleading information can end up being published. Moreover, authors have an axe to grind—they may indicate a greater need to fill some gaps in knowledge than would be generally accepted by other investigators.

The introduction will also indicate the approach or methods used in the investigation and the principal results and conclusions. In order to facilitate reading by a wide audience, the authors may introduce and define specialized terms and abbreviations that will be used in the paper. Most importantly, the introduction must capture and keep potential readers’ attention, for if their attention is lost early it will likely never be regained. The introduction also often contains the most significant conclusion of the paper so that the reader can judge the evidence that will be presented in context.

Materials and methods Purpose The purpose of the materials and methods section is to provide enough details and references to enable a competent scientist to repeat the work and to evaluate the adequacy of the methods. In actual practice, few studies are precisely replicated, and a very worrying development has been that it appears that a large proportion of studies published in high-impact journals could not be repeated when investigators attempted to do so. Prinz et al11 concluded “the literature data on potential drug targets should be viewed with caution and underline the importance of confirmatory validation studies.” They further point out some reasons for the lack of reproducibility such as negligence in control over or reporting of experimental conditions, including insufficient description of materials and methods. The materials and methods section also gives authors the forum to show that they used appropriate methodology—that is, methods that enable their study to address the research problem—and may also indicate the care with which experiments were done. Vague or inadequate description of methodology has been found to be one of the most common criticisms of manuscripts by referees. Similarly, poor methodology yielding potentially faulty results is listed among editors’ most common criticisms of manuscripts.12 The methods section was identified as the section of a paper that usually contains the most flaws, and failure to give a detailed explanation of the design is found to be among the top three problems that are most often responsible for outright rejection of a paper.13 Clearly the methods section is one that attracts reviewer attention and therefore should be prepared carefully.

36

Brunette-CH03.indd 36

10/9/19 10:33 AM

Components and Their Functions

Structure It is useful to provide a framework for the reader at the beginning, and this may be done with an overview sentence of the investigational design possibly aided by a diagram. In general I think a diagram that incorporates groups, randomization, timing of procedures and observations, and outcome variables is particularly helpful for readers of clinical studies. A number of conventions for outlining investigational designs have been developed (see for example that of Campbell and Stanley,14 which I have utilized in the previous editions15 as well as the current one). The materials and methods section often employs subheadings extensively. This enables the reader to easily home in on the desired information, such as being able to read about an assay method without having to read about other methods that are of less interest to a particular reader.

Content Content obviously varies by the type of study, and there are varying traditions on how methods are reported. Perhaps the easiest approach to determining the level of detail is to examine publications in the journal to which you will be submitting the paper. For example, my lab’s research falls into the category of quantitative experimental research. I advise my students to look at past publications from our lab to see the level of detail in which materials and methods were described. For example, the quantities, sources, or methods of preparation for materials are given, and generic names are used rather than proprietary or trade names as trade names can vary depending on country of origin. While new methods or modifications of existing methods must be described in detail, standard, well-accepted methods are usually described simply by a citation (eg, protein determined by the method of Lowry et al—the all-time leader in number of citations collected). There are some standards for the presentation in the biologic sciences such as MIBBI (minimum information for biologic and biomedical investigations) that includes guidelines for reporting for such approaches as microarrays, cellular assays, etc (see fairsharing.org). For studies involving humans, the common practice is to describe how the population was selected (exclusion and inclusion criteria) and characteristics of the sample (age and sex distribution, etc). The institutional review board that approved the study should be stated.

For studies involving animals, the source should be listed as well as the special characteristics employed in the study (eg, age, sex, housing conditions, and diet). For clinical research such information as whether the assessors, patients, and statistician were blinded to group membership of the patients is also required. The components of this book that are of special importance to the materials and methods section include measurements and their errors (see chapters 12, 13, and 15) and the investigational strategy, tactics, and design (see chapters 16 to 20). Aspects of measurement include operational definition, precision, accuracy, validity, and reliability. These topics, as well as investigational strategies and experimental design, will be discussed later. In general, each strategy is associated with characteristic strengths and weaknesses, and knowing these enables a researcher to criticize papers more efficiently. For clinical research there is fairly widespread acceptance of the concept of a hierarchy of evidence. One example from the United States is given in Table 1-1.16 Other hierarchies place meta-analysis of RCTs at the highest level and may be further divided according to the quality of the meta-analysis. Commonly materials and methods sections appear to have been treated somewhat casually by authors and indeed by journals, where in some instances insufficient space was made available for methods to be described precisely. The “reproducibility crisis” (discussed in chapter 2) has resulted in some journals allocating more space to this important aspect of science.

Results This section presents the evidence for the conclusions. The observations and experiments are often presented in an order and a form that is designed to convince the reader of the truth of the paper’s conclusions. The chapter on presentation of results (chapter 14) shows how data, figures, and tables may be manipulated to further the rhetorical purposes of an author. Tufte17 states that the standard of evidence can be judged on the criteria of integrity, quality, and relevance.

Integrity The extent of scientific fraud such as fabrication, plagiarism, and falsification is not known with any precision but appears to be sufficiently problematic that institutions and granting agencies have established procedures for reporting and investigating

37

Brunette-CH03.indd 37

10/9/19 10:33 AM

THE COMPONENTS OF A SCIENTIFIC PAPER

breaches of research integrity. The US National Institutes of Health’s policies are built on four shared values in scientific research: (1) honesty (convey information truthfully), (2) accuracy (report findings precisely, with care being taken to avoid errors), (3) efficiency (avoid waste), and (4) objectivity (let the facts speak for themselves, no improper bias).18

Quality This includes such factors as technical quality (such as in focus of micrographs). Relevance entails presenting information in context. Key issues are what comparisons are made and how they are made. The main worry is that authors will make choices on such issues as scales or suppression of the baseline to enhance the apparent size of an effect or present data on ratios to suppress variation or even omit essential data so that the reader can be misled (see chapter 14).

Discussion/Conclusions Purpose The purpose of the discussion and conclusion section is to summarize the major findings and explore the relationship between the findings of the study and the facts reported by others (ie, the published literature). The discussion contains the logical arguments that link the data in the results section as well as the work of other investigators to the conclusions. In general, the arguments tend to be comparative and buttressed with statistical analysis to demonstrate differences, such as present results versus results of earlier investigations, values in treated groups versus control groups, observed results versus those predicted by theory, comparison of values with time, and so forth. Although the overall aim of the discussion is to provide compelling support for the conclusion, the literature cited in the discussion should be balanced, that is include papers that might cast doubt on the conclusion (if such exist). As medieval philosophers were aware, “a difference requires a distinction.” Thus explaining differences in findings often entails noting differences in conditions, models, or techniques that can explain the discrepant observations of others. It is also important to indicate the significance of the findings, such as how they bear on the theory or practice of the discipline. This aspect is important in spiking the

guns of nasty critics who might ask “so what?” after reading the paper. Not uncommonly the discussion incorporates suggestions for future experiments that expand understanding of the phenomenon or resolve discrepancies. The discussion should also promote the advantages and acknowledge the limitations of the study. What should be included in the advantages and limitations of the study can often be guided by the weakness and strengths of the study design. For example, survey research can suffer from the threat to validity of selection. Investigators send out survey forms with the intent of learning about a target population. Not all the people who receive the surveys return them so there is a well-known possibility of response bias, that is that the sample obtained is not representative of the target population. As most surveys have the potential for this problem, in all likelihood it will have to be discussed. Indeed in most instances the knowledgeable reader would expect to see the issue of the representativeness of the sample discussed. Meeting reader expectations by discussing anticipated objections acts to make the author more credible and thus the paper more persuasive. On the other hand, just as writers should be forthright in discussing limitations so they should not be hesitant to point out their studies’ strengths. Some readers might not be aware of the advantages of a particular research design and need education on some points. For example, a study using factorial design has the ability to assess the interaction between factors, but this advantage appears to be unfamiliar even to some investigators who conduct experiments that could be analyzed with two-way analysis of variance appropriate for factorial design but fail to do so. The discussion section is typically where rhetoric is most overtly practiced, as the importance of some observations is emphasized, while the importance of others is downplayed or even ignored. An editorial in the British Medical Journal noted that the editors see many papers where the purpose of the discussion seems to be to “sell” the paper. They proposed a “structured discussion,” analogous to the structured abstract, that follows the sequence: (1) statement of principal findings; (2) strength and weaknesses of the study; (3) strengths and weaknesses in relation to other studies, discussing particularly any differences in results; (4) meaning of the study—possible mechanisms and implications for clinicians or policymakers; (5) unanswered questions and future research.19

38

Brunette-CH03.indd 38

10/9/19 10:33 AM

References

References 1.

Penrose AM, Katz SB. Writing in the Sciences: Exploring Conventions of Scientific Discourse. New York: Longman, 2010:93. 2. Huth EJ. How to Write and Publish Papers in the Medical Sciences. Philadelphia: ISI, 1982. 3. Watson JD, Crick FH. Molecular structure of nucleic acids; A structure for deoxyribose nucleic acid. Nature 1953;172:737–738. 4. Judson HF. The Eighth Day of Creation, ed 2. Cold Spring Harbour: Cold Spring Harbour Laboratory, 1996. 5. Tacker MM. Parts of the research report: The title. Int J Prosthodont 1990;3:396. 6. International Committee of Medical Journal Editors. http://www. icmje.org. Accessed 23 July 2018. 7. Primack RB, Cigliano JA, Parsons ECM. Coauthors gone bad; How to avoid publishing conflict and a proposed agreement for coauthor teams. Biol Conserv 2014;176:277–280. 8. Sackett DL. Evaluation: Requirement, for clinical application. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:123. 9. Tacker MM. Parts of the research report: The abstract. Int J Prosthodont 1990;3:499. 10. Fisher A. The Logic of Real Arguments. Cambridge: Cambridge University, 1988.

11. Prinz F, Schlange T, Asadullah K. Believe it or not: How much can we rely on published data on potential drug targets? Nat Rev Drug Discov 2011;10:712. 12. Byrne DW. Avoiding common criticisms. In: Byrne DW (ed). Publishing Your Medical Research Paper. What They Don’t Teach in Medical School. Baltimore: Williams & Wilkins, 1998:49. 13. Byrne DW. Preparing to write a publishable paper. In Byrne DW (ed). Publishing Your Medical Research Paper. What They Don’t Teach in Medical School. Baltimore: Williams & Wilkins, 1998:51– 64. 14. Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally College, 1963. 15. Brunette DM, Hornby K, Oakley C. Critical Thinking: Understanding and Evaluating Dental Research, ed 2. Chicago: Quintessence, 2007. 16. Kroke A, Boeing H, Rossnagel K, Willich SN. History of the concept of “levels of evidence” and their current status in relation to primary prevention through lifestyle interventions. Public Health Nutr 2004;7:279–284. 17. Tufte ER. Beautiful Evidence. Cheshire, CT: Graphics, 2006:9. 18. National Institutes of Health. Research integrity. https://grants. nih.gov/policy/research_integrity/what-is.htm. Accessed 23 July 2018. 19. Docherty M, Smith R. The case for structuring the discussion of scientific papers. BMJ 1999;318:1224–1225.

39

Brunette-CH03.indd 39

10/9/19 10:33 AM

4 Rhetoric



Science, then, necessarily involves rhetoric. And it also places scientists in what Burke calls ʻthe human barnyardʼ where motives are never altogether pure and language must dramatize the inevitable ambiguity of motives.” S. MICHAEL HALLORAN1

T

o publish in good journals, authors must convince referees and editors that their work is new, true, and important. To get work funded, investigators must persuade committees of their peers that the work is feasible, significant, and novel. Often, the criteria for such things as novelty and significance are not straightforward; the writer of the grant or paper must pose arguments for these qualities in the most persuasive way possible. Rhetoric, defined by Aristotle circa 323 BC as the faculty of discovering all the available means of persuasion in a given situation, doubtless has been practiced since before recorded time, but it played a particularly important role in ancient Greece and Rome. Two excellent texts on classical rhetoric that have been used extensively in preparing this section are those of Habinek2 and Corbett and Connors.3 Rhetoric was central to decision making in the classical world, and instructors in rhetorical techniques flourished. Indeed, handbooks of rhetoric are perhaps the best documented genre of ancient writings.4 Civic discussion took place in public forums, and citizens voted on a wide variety of issues from state decisions, such as going to war or forming alliances, to criminal cases and property disputes. (Only male citizens participated in these discussions, while women, slaves, and outsiders were excluded.5) As these assemblies heard arguments, rhetoric focused to some degree on how the arguments sounded. Key elements of rhetoric were memorization and oration. Lists of rhetorical terms6 include devices that add flourish to delivery rather than clarity to argument. When Vice President Spiro T. Agnew railed against the “nattering nabobs of negativism,” he employed alliteration and onomatopoeia. A country-and-western singer might employ syncrisis when he croons, “Pick me up on your way down.” Such techniques find little application in scientific writing. Nevertheless, rhetoric plays a large, if sometimes subtle, role in science and has been the subject of close investigation.7 As noted by an editor of Lancet, a scientific paper is an exercise in persuasion,8 and it follows that readers of this literature should learn the tricks of the persuasive trade. One problem with rhetoric for scientists, who are presumably interested in determining the true nature of things, was understood by Socrates and Plato; there is a fundamental dichotomy between the

40

Brunette-CH04.indd 40

10/9/19 10:36 AM

Classical Rhetoric

aims of the speaker and the aims of the audience. The speaker’s goal is to persuade; the audience’s goal is truth.8 Plato in particular has been described as being no friend of rhetoric.4 A platonic dialogue in Gorgias states, “The orator need have no knowledge of the truth about things; that it is enough for him to have discovered a knack of convincing the ignorant that he knows more than the experts.”9 Scientists have traditionally been wary of rhetoric even as they practice it. “Scientific heads are not turned by rhetoric” is the title of an article by a prominent proponent of evidence-based medicine.10 However, rhetorical techniques and methods of influence abound in scientific articles and in the social interactions of scientists. A sensitive reader, particularly one who believes in Merton’s norms for scientists and the storybook version of science, might be offended by some of the information that follows. My approach here is descriptive rather than prescriptive; I am not recommending any particular technique. The pragmatic point is that readers of the literature should know how to recognize rhetoric in practice and, on occasion, fight through the rhetoric to determine the soundness of the conclusions. Some of the ancients’ rhetorical techniques are employed commonly in science in what may be called “classical ways.” One weakness in classical rhetoric, at least from a modern viewpoint, is the reliance on commonplaces, which were general arguments believed by many of the populace that could be used on many occasions. Examples include, “death is common to all” (used in praising fallen soldiers), “time flies” (when there is a need for urgency), or “death before dishonor” (when some might contemplate surrender). Modern persuasive techniques make less use of classical commonplaces, probably because modern audiences do not find them convincing.4 Moreover, scientific arguments tend to be specific. Modern persuasive techniques have also been developed as a result of psychologic and sociologic scientific investigation, rather than the rough-and-ready empiricism of the ancients, who might judge the effectiveness of a speaker by the crowd he generated either at the door of the Senate or in the marketplace when it was his turn at the rostrum. Nevertheless, persuasive techniques used in science are best considered against a background of the classical methods, as well as modern psychology.

Classical Rhetoric Five canons of rhetoric Aristotle proposed five canons of rhetoric, which are named here in Latin: 1. inventio: methods for discovering arguments 2. dispositio: methods for the effective and orderly arrangement of the parts 3. elocutio: style, “proper words in proper places” 4. memoria: memorizing speeches 5. pronuntatio: delivery of speeches The latter two canons are not relevant to analyzing scientific literature but are relevant to presentations at scientific meetings.

Inventio Inventio refers to a system for finding arguments. In theory, a well-trained orator could speak on any issue and on any side of an issue, but he needed raw material: the facts (or disputed facts) of the case, general principles applying to a situation, and convincing approaches to present the argument. Aristotle recognized two kinds of topics. The first—those useful in a special area of knowledge or arising from the particular circumstances—were called idioi topoi in Greek. In a judicial setting, for instance, Aristotle named five kinds of proofs: laws, witnesses, contracts, tortures, and oaths. Similarly, a scientific paper will have data specific to the investigation. A study in clinical periodontics might include data on the number and characteristics of the patients, the procedures used, and the outcomes of various measurements—such as attachment level—at various times. These types of topics constitute the raw material of the paper. Aristotle’s second group of topics included those useful in arguments of all kinds—the so-called common topics (koinoi topoi). Aristotle named four common topics: (1) more or less (the topic of degree), as occurs in debates on any situation (eg, fluoride) where a recommended dosage is suggested; (2) possible and impossible, which can occur with respect to compliance with some theory; (3) past and future, as occurs in some epidemiologic questions; and (4) greatness and smallness (the topic of size), which addresses the common concern of the size of a treatment effect. Aristotle also listed 28 topics that might be considered as techniques to improve or support arguments.6

41

Brunette-CH04.indd 41

10/9/19 10:36 AM

RHETORIC

An author may want to redefine a key term to support a contention or to define terms to favor the argument. (For example, in the field of dental implants, there was considerable debate over what criteria should be used to define a successful implant.) In making his list, Aristotle began the tradition of philosophers making lists of good or fallacious arguments. Similarly, the gloomy philosopher Schopenhauer prepared a list of 38 ways to win an argument,11 12 of which appear morally dubious and are the kinds of techniques that might be used by academic administrators who believe that the ends justify the means. To make a scientific paper persuasive, however, authors have to employ persuasive techniques. Aristotle listed three main types of proof.6 Logos or rational appeal The arguments in a scientific paper could include both deductive and inductive logical approaches. Typically, deduction is certain and makes conclusions from statements, while induction makes inferences from verifiable phenomena. The enthymeme is a deductive syllogism in which a premise is missing. In rhetoric, Aristotle favored the use of enthymeme when an argument in syllogistic form has a premise, belief, or value that the rhetor thinks is shared by the audience.12 In a scientific paper, the deductive approach often occurs when the authors predict the consequences of their hypothesis. For inductive logic, Aristotle used the example, which can be provided to show the probability of a statement being true, but is always be subject to challenge and refutation. In the periodontics example, if a difference were observed between two groups treated in different ways, the authors would be required to show through appropriate statistical testing that the difference could not be explained by chance. The rational appeal is central to all scientific papers. Pathos or emotional appeal Aristotle did not necessarily approve of the use of the emotional appeal, but he did know that it worked. Use of emotional arguments in scientific papers is rare; one of the norms of science is disinterestedness. But pathos does feature in discussions of science policy. Waddell13 has written on the role of pathos in the decision-making process and has considered in detail the debate over the use of recombinant DNA in the early days of this technology in Cambridge, Massachusetts. The proponents’ arguments included the use of the proposition, “If your child were suffering from these diseases [diseases that might be solved by advances in genetic knowledge], you would want help from those who could

help.” The opponents argued, “Violating three billion years of evolution is dangerous.” These views are a long way from dispassionate decision-making algorithms. Similarly, emotionally charged debates occur today on the issue of embryonic stem cells. Emotion also surfaces in disputes over priority and funding. Multidisciplinary collaborative research has many virtues, but a vice is that the principal investigator, who controls the funds, may be tempted to direct funds disproportionately to his or her own projects. Funding decisions made in committees are supposed to be confidential, but sometimes information leaks out of the committees and supplies victims of bad reviews with motivation to inflict similar pain on their tormentors. In such cases, the motivation of pathos is not mentioned, and the criticisms are cloaked in the objective garments of logos. The effectiveness of pathos in persuading juries in personal injury torts is legendary. Melvin Belli, the King of Torts, dropped an artificial limb into a juror’s lap on behalf of a client who had lost her leg in a streetcar accident. Such demonstrative evidence was designed to help the jury understand in graphic terms the consequences of losing a limb. Ethos or ethical appeal Ethos or ethical appeal stems from the character of the speaker as shown in the speech (or paper) itself. Aristotle thought that this could be the most potent of the three modes of persuasion. The recommended procedure involved several steps: (1) ingratiate oneself to the audience to gain trust, (2) show intelligence, (3) demonstrate benevolence, and (4) exhibit probity. Aristotle commented that persuasion through character occurs “whenever the speech is put in such a way as to make the speaker worthy of credence . . . And this should result from the speech not from a previous opinion that the speaker is a certain kind of person.”3 That may have been true for his fair-minded Greek audience, but it would appear to be dubious in modern-day discussions. For example, when a professor rises to take part in a debate in a faculty council, the members of the council have already formed a judgment about the professor’s motives, the causes he or she has espoused in the past, and the outcomes of decisions that he or she advocated. These considerations cannot be erased simply. Indeed Roman practitioners of rhetoric taught that a speaker’s credibility and the weight of his words were intertwined not only with the current speaker, but also with his family history.12 That tendency survives today; Canada’s current prime minister is the son of a previous prime minister, who

42

Brunette-CH04.indd 42

10/9/19 10:36 AM

Classical Rhetoric

Box 4-1 | Medieval versus contemporary student rhetoric Letter from rhetoric student B to his venerable master A, Oxford ca 122018 “This is to inform you that I am studying at Oxford with the greatest diligence, but the matter of money stands greatly in the way of my promotion . . . the city is expensive and makes many demands, I have to rent lodgings, buy necessaries and provide for many other things which I cannot now specify. Wherefore I respectfully beg your paternity that by the promptings of divine pity you may assist me so that I may complete what I have well begun. For you must know that without Ceres [goddess of corn] and Bacchus [god of wine], Apollo [Sun god, poet] grows cold.” Email from law student Maxwell Brunette to his father, Saskatoon 2001 “If I maintain my average I will graduate with distinction . . . I have looked at my projected finances and debt position for the next 6 months and have realized that I will need help. . . . I will have a number of fairly large expenses to meet, such as rent deposits, rent, buying such necessities as a bed, microwave etc. . . . [please send] an extra $500 included in the check that will pay for the repairs to the car.”

in the eyes of some disadvantaged the oil-producing provinces and incurred record debt, forcing his successors to adopt austerity measures. Critics note the same failings in his son. In science and scientific writing, ethos is established primarily by conforming with the norms of scientific behavior outlined earlier: universalism, communalism, humility, originality, organized skepticism, and disinterestedness. It has been said that scientific writing style capitalizes on the convenient myth that reason has subjugated the passions. Objectivity and disinterestedness are indicated by the use of the passive voice. Ethos is further established by the choice of methods and interpretive structure; choice of sound methods indicates a sound investigator. Finally, ethos is also indicated by the choice of literature citations. To a large degree, because negative citations are rare, the use of citation implicitly indicates that the cited work is accepted and is important to the current work— otherwise why cite it? As recognition is the coin of science, citations are like little gifts to the author and are thus part of the scientific reward system.14 Conversely, failure to cite someone could be construed as an insult and could be rhetorically damaging. A book on writing grant applications suggests that applicants find out the names of grant committee members and cite them, provided such citations do not appear contrived.15 Regardless of the motivation, citations to the work of others establish the author as a fairminded person. A concern with ethos also characterizes modern writers of scientific articles. In an essay on the birth of molecular biology, Halloran16 analyzed the rhetoric of Watson and Crick’s 1953 brief paper that was the first public announcement of their work on the structure of DNA. Halloran describes the most striking feature of

the paper is its genteel tone with its delicate approach. Watson and Crick believed that Linus Pauling had made an incredible blunder in his model but expressed the problem gently: “Without the acidic hydrogen atoms it is not clear what forces would hold the structure together.” Halloran then notes that the ethos employed in the article shaped a particular image of the scientist speaking “within a broader set of more vague and general norms that apply to all scientific discourse.” In everyday life, ethos or communicator credibility is seen as a combination of expertise and trustworthiness. Peripheral characteristics, such as outward appearance, also play an important role. Arguing against their own self-interest, as scientists sometimes do in testing hypotheses, increases trustworthiness. The degree to which audience opinion can change depends on communicator credibility; a trusted communicator can achieve greater movement, while those with doubtful credibility can hope only for modest change.17 Credibility in science is often judged informally by such things as number, source, and amount of grants and papers published in leading journals as well as awards. The Roman orator Cicero summed up the classical approach in his advice to orators, “Charm [ie, credibility, ethos], Teach [ie, sound message, logos], and Move [ie, call to action through emotion, pathos].” For many years, these steps have been effective. As an example, we will consider two letters from students to their fathers asking for funds to continue their studies (Box 4-1). The first example comes from Oxford in the 13th century, that is, some 1,600 years after Aristotle. As background, the seven liberal arts of the Middle Ages were divided into two segments: the most basic required knowledge of the trivium (hence the word trivial), comprising grammar, rhetoric, and logic, and the

43

Brunette-CH04.indd 43

10/9/19 10:36 AM

RHETORIC

more advanced requirements called for understanding the quadrivium, comprising arithmetic, astronomy, geometry, and music. It may seem strange that rhetoric precedes arithmetic in the curriculum, but, to continue their studies, students frequently had to convince their patrons to send money. Thus, most of the extant examples of medieval rhetoric are letters to parents or other sponsors requesting funds. These letters could show considerable sophistication; one exercise showed19 ways to approach an archdeacon for funds. Thus, modern manuals for grant proposals descend from an ancient line. The letter from the medieval rhetoric student first establishes his ethos (“I am studying with the greatest diligence”); he then makes the rational appeal of logos by enumerating his expenses. Then he moves to the call to action through pathos, for, without anything to eat and drink, his intellectual work must fail. My son Max’s email, some 700 years after the Oxford student and some 2,300 years after Aristotle, is remarkably similar. Ethos is established by the statement that if he maintains his average, he will graduate with distinction. Note that this ethos is specific to this communication. The reader, his father, is supposed to forget other times when his behavior was not so praiseworthy. (He finds no need to mention, for example, the grade school incident, in which he annoyed the French teacher by throwing his textbook out the window; such embarrassing events, which might discredit the ethos of the current scholar, are best left out of the discussion.) As in the Oxford student’s letter, logos is represented by the enumeration of expenses. Also, like the Oxford student who must “provide for many other things which I cannot now specify,” Max leaves himself some “wiggle room” by including the “etc” among the list of expenses. Neither student wants to specify costs too precisely. In Max’s case, the motivational portion of Cicero’s dictum—that is, the call to action—came in a supplementary telephone call that resulted in funds being dispatched from Vancouver to Saskatoon. (To be fair, the expenditure was worthwhile; Max did graduate from law school with distinction.) This example indicates that the classical approach to persuasion continues to be effective.

Dispositio Dispositio concerns the orderly and effective arrangement of arguments. Scientific writing tends to be logical; indeed, one authority believes that a scientific paper is primarily an exercise in organization. The format of the typical scientific paper,20 IMRaD (Introduction,

Methods, Results, and Discussion), is analogous to Aristotle’s approach of introduction, narration, proof, and epilogue.8 While orators employ a number of organizational schemes,21 the most common scheme used in scientific presentations is the zoom-in . . . zoom-out (zizo) approach, whereby the author reviews the general area of research in the introduction and “zooms-in” to the specific problem studied. In the discussion section, the author “zooms-out” to show how his or her research has advanced the field. The book Dazzle ’Em with Style22—intended as advice for young scientists—recommends only the zizo approach for giving seminars. However, a more detailed look at the rhetorical strategies used in the introductions of papers finds that scientists typically make four moves: (1) create interest, (2) review history, (3) show a gap, and (4) introduce new research.23 Nevertheless, the specific purposes of some papers are better served by other approaches, some of which are covered by Spicer.21 A chronologic approach, outlining how a procedure has developed over time, can be effective in reviewing a topic. A dialectic plan, comparing two competing hypotheses and how they are to be resolved by the current study, can also be effective, particularly if one has been able to design a crucial experiment. The facet approach, whereby the author considers diverse aspects of a given problem, was applied to great effect in a seminar describing periodontal disease given by my former dean, Paul Robertson, when he applied for the deanship of dentistry at the University of British Columbia. Robertson used the allegory of a dragon to symbolize aspects of the diseases (eg, the fiery breath of the dragon represented the heat of inflamed tissues). That I remember that seminar given over 40 years ago demonstrates that the confidence to apply novel approaches can make a speaker stand out from the pedestrian crowd, who cling resolutely to the single organizational plan of zooming in and zooming out.

Elocutio Elocutio is described as putting proper words in proper places, but communicator style involves more than word choice. Scientific authors attempt to imply their objectivity with word choice. Some of the issues of word choice in science are similar to those in writing generally; thus, the same principles found in reading critically apply to reading scientific writing.23 Accuracy is crucial in scientific word choice; terms used must conform with the rest of the scientific

44

Brunette-CH04.indd 44

10/9/19 10:36 AM

Audience

literature or must be defined explicitly. Precise language use avoids the problem in logic called the fallacy of equivocation. Moreover, scientists are typically conservative in their claims. Combined with the need for accuracy, this conservatism, as noted earlier, leads to conclusions that are qualified with hedge terms or phrases. Many authors finding that some factor X affects some biologic activity Y but lacking any concrete evidence for the actual mechanism(s) will be hesitant to write, “X regulates Y.” Rather, scientists will write something like, “X plays a role in regulating Y,” a phrase that could be interpreted as, “I don’t know how X works, but it is involved somehow.” The problem of presenting evidence and indicating its strength is of particular interest in science. Because all studies build on the foundation of other studies, citations play a key role. Latour and Woolgar19 developed a scale to assess the strength of evidence by how it is cited. The strongest evidence, in their view, is so self-evident or well known that it needs no citation (eg, the chemical formula for water is H2O). Using an author’s name to identify the study constitutes a stronger endorsement of the evidence than simply citing the study, and, of course, qualifying the findings of a study indicates the study’s lack of strength or generalizability. For scientific writers, however, the question of how much to qualify their own results is a particularly thorny problem. Claiming too much invites referees to point out that the claims exceed the evidence, but claiming too little may lead them to ask, “So what? How has this paper advanced science?” An aggressive strategy is to state claims more strongly than the presented evidence supports and to cut them back in response to referees’ criticisms. Consequently, readers of scientific papers should look closely at qualifying statements or hedges, which may reveal referees’ concerns that likely are also the concerns of the reader. Serving on a grants panel, I was once the primary reviewer of an application that proposed the use of a model system to investigate a process that occurred in vivo. However, one of the applicant’s previous papers contained a qualifying sentence stating that the in vitro model system would be a very poor model of what would occur in vivo. Thus, this qualifying sentence undermined the premise of the proposed research program. Given the applicant’s intention to build future research on this in vitro system, it is unlikely that the qualifying sentence was included in the paper initially, but rather that it was a hedge inserted to meet referees’ concerns. A common defensive strategy is to use the modesty trope, a conventional disclaimer to preempt criticism. Users of an in vitro system might write something like,

“Further experiments using animals are required to validate the mechanisms being proposed as a result of these in vitro studies.” Books on scientific writing often contain lists and rules for avoiding the most common lapses of style. According to these lists, wordiness and jargon are the chief problems, and they recommend replacing phrases like a majority of with most or rephrasing in my opinion, it is not an unjustifiable assumption that with I think. One of the most interesting and effective approaches to sentence construction is that of Gopen and Swan,24 who advocate locating sentence elements according to common expectations of readers.

Audience Most of the discussion of classical rhetoric presented above is given from the perspective of what the rhetor does; the audience has not been specifically considered. Classical rhetors knew the importance of tailoring their speeches to their listeners, but they lacked the scientific approach of controlling stimuli and quantifying responses that characterizes modern psychology. Moreover, modern studies of persuasion consider a broader range of influences, such as radio, television, and social media than were considered by classical authorities. Scientific discourse occurs over a wide array of forums where persuasive techniques play a key role in decision making, including editorial boards for scientific publications, grants panels, science policy debates, and university committees. The formalism present in the IMRaD model used by scientific papers is absent in many of these situations. Scientists in these modern forums persuade and are persuaded by the approaches and techniques used in everyday discourse.

Choosing the audience One advantage of scientists is that they can, to a certain extent, choose their audiences by submitting their work to journals or granting agencies that they select themselves. Learning the names of editorial board or grants panel members helps scientists determine where to send their work. Even the most fair-minded scientists would favor sending their work to places where it might be appreciated. To help this process, a scientist might suggest to the editor that certain reviewers are uniquely well qualified to review the work. Authors should refrain from being too obvious in advocating close

45

Brunette-CH04.indd 45

10/9/19 10:36 AM

RHETORIC

friends and colleagues; referees should be at arm’s length (sometimes interpreted as not having published with the author for at least 5 years). Conversely, the author can mention that certain persons would not be appropriate because of personal or professional differences, but this is best done before the fact; correcting such a situation afterwards is exceedingly difficult (particularly as it is often difficult to be sure of the identities of the anonymous referees). In his Advice to a Young Scientist, Medawar25 notes the following: There are times when referees are inimical for personal reasons and enjoy causing the discomfiture that rejection brings with it; too strenuous an attempt to convince an editor that this is so may, however, convince him only that the author has paranoid tendencies.

Abelsonʼs “MAGIC” Criteria for Persuasive Force R. P. Abelson, a specialist in psychologic statistics, has stated that the purpose of statistics is to organize a useful argument from quantitative evidence using a form of principled rhetoric. He proposed that several properties of data as well as their analysis and presentation govern their persuasive force. The first letters of these properties form the acronym MAGIC: magnitude, articulation, generality, interestingness, and credibility.26 I believe these criteria apply generally to all types of experimental research. Magnitude refers to the size of the observed effect. Often investigators wish to compare two means. Frequently this comparison is made by subtracting one mean from the other to give a statement such as, “Pocket depth was reduced 1 mm by rinsing daily with chlorhexidine.” Such a difference, however, must be interpreted in the context of the variability of the data, and effect-size indicators have been developed for various applications (see chapter 21). Magnitude is also sometimes estimated by the ratio of means between the treated and control groups. This calculation gives rise to statements about the relative increase or decrease caused by a treatment, often described by such terms as “twofold increase” or “50% inhibition.” Another example of ratios occurs in risk assessment such as relative risk and odds ratio. Scientists’ concern with magnitude is necessary to complement the other requirement for comparisons between groups—namely,

that the difference is unlikely to be caused by chance. When, however, the number in the groups is large, the statistical tests become very powerful so that they can identify tiny effects. That is, there may be statistically significant difference demonstrated for some treatment, but the size of the difference is so small that the treatment does not have a meaningful effect. It is evident that scientists find magnitude a convincing argument, given the widespread manipulation of scales to produce larger apparent effects in figures. Because this problem is so prevalent, Tufte27 has proposed an index called the lie factor, which is the ratio between the apparent visual effect seen in the figure and the actual numeric effect size (see chapter 14). Abelson26 defines articulation as the degree of comprehensible detail in which conclusions are phrased. A well-articulated publication, in Abelson’s view, tells readers what they ought to know to understand the point of the study and what the results were. Many papers, to their detriment, fail to measure up to the criteria of articulation, because they end on a whimper, not a bang; they seem to end when the author runs out of words, rather than with a strong conclusion. I think articulation can also be considered as the degree to which claims are supported. Typically, papers in higher-impact journals offer more evidence—and more types of evidence—to support their claims than those in lower-impact journals. In a study on cell signaling, one might consider publishing the results of immunostaining on one component of a signaling pathway, but that strategy likely would not be effective in getting the study published in a high-impact journal. Such a journal might require more types of evidence to support the claim, such as gene knockout studies, Western blots to show differences in specific protein levels, inhibitor studies, or immunostaining of other components in the relevant pathways. Generality refers to the breadth of applicability of the conclusions. The conclusions of any research study are strictly limited to its particular conditions, but studies that can be applied widely, at least in theory, will have more impact. This happens quite naturally, because a prime reason for scientists reading the scientific literature is to apply concepts or techniques to their own studies. A study using conditions that are difficult to replicate or apply only to a very limited population will often have little impact, because other scientists cannot apply it to their own work. Interestingness, Abelson26 believes, indicates that the study has the potential to change what people believe about an important issue. An interesting result is often one that is surprising, and, because it unsettles

46

Brunette-CH04.indd 46

10/9/19 10:36 AM

The Persuasion Palette

previously held concepts, it makes the reader think about the topic. However, it can be risky to be too interesting. The principle of least difference holds that the master craftsman (or, in this case, scientist) demonstrates something that is sufficiently different from its predecessors to be considered distinctive, but the difference is sufficiently small that it does not imply criticism of what has preceded. Credibility refers to the overall believability of the conclusions, and it rests both on the soundness of the methodology and on how well the results agree with, or at least do not overtly contradict, other well-established studies in the same area. Abelson26 states that a research claim or new theory that contradicts a prevailing theory or common sense can expect attacks on two fronts: (1) methodology, such as data analysis or technique, and (2) coherence, that is, the question of whether a new theory can explain a range of interconnected findings. The burden of proof will rest with the proponent of the new theory.

The Persuasion Palette The process of scientific persuasion is somewhat like the production of a work of art. In painting, the artist uses a variety of colors to produce a unified whole. The particular colors chosen vary among paintings according to the desired effect; Mondrian’s paintings blaze with primary colors, whereas Rembrandt’s paintings tend to be symphonies of blended earth tones, but both painters produced works that convinced critics to consider them masterpieces. Similarly, scientists employ various persuasive techniques depending on the desired effect. In choosing among rhetorical options, the scientist can be thought of as using a palette of persuasive techniques—the persuasive palette. The persuasive palette includes various theoretical principles and empirically based observations that can be mixed and matched to produce an overall persuasive effect. The components to be discussed include peripheral and central routes to persuasion, the principal principle, the theory of cognitive dissonance, the law of cognitive response, the cognitive miser concept, and Cialdini’s seven principles of influence.

Two routes to persuasion Excellent overviews of persuasion have been published by Pratkanis and Aronson,17 Cialdini,28 and Simons,29 to

which the reader is referred for more detailed information. The following sections are largely based on those sources, supplemented with more specific examples of how persuasive techniques are applied in the scientific community. Modern psychology has determined that there are two routes to persuasion.29,30 1. In the peripheral route, the recipient extends little effort or attention, ie, System 1 thinking.30 When the audience operates in this mode, persuasive factors such as the attractiveness of the presenter and the presentation of any reason, however bogus, may actually work. Dental conferences, for example, typically have exhibitions directly or indirectly concerned with selling goods; the people working in the exhibitors’ booths often possess more pulchritude than the attendees wandering the aisles. How much attention will a dentist pay to the selection of toys for his “treasure chest” for young clients? An attractive salesperson may tip the balance from one supplier to another. 2. In the central route, ie, System 2 thinking,30 the recipient engages in careful consideration of the true merits of the case. The recipient may argue against the message, ask questions, and seek out other sources of information. For example, in buying a digital x-ray unit, a dentist might approach various companies, check with other dentists, and closely calculate the costs and the pros and cons of the various units. Thus, the dentist would adopt the more rigorous central route to investigate a major purchase. This approach to decision making mirrors the legal principle that the level of proof required increases with the seriousness of the case (eg, civil cases are decided on the balance of probabilities, and criminal cases adopt a beyond-reasonable-doubt standard). Therefore, expensive purchases warrant more consideration than inexpensive ones. This example also illustrates that the choice of route depends on involvement. When a decision affects the decision maker personally, the tendency is to use the central route to develop counterarguments and seek out information. When the decision does not have a personal effect, the decision maker tends to use the peripheral route. In Canada, there has been a recent tendency to markedly raise tuition to professional programs, such as law, medicine, and dentistry. The administrators’ problem is that they at least must appear to have

47

Brunette-CH04.indd 47

10/9/19 10:36 AM

RHETORIC

consulted the students about the matter. Students who will have to pay the increased fees will question whether the increase is justified; they will use the central route in evaluating the arguments for increases and will develop counterarguments. Administrators anticipating this problem use the following ruse: They consult the graduating class, that is, with students not subject to the increased fees. These students have only marginal interest, tend to adopt the peripheral route in evaluating the proposal, and, are generally easier to convince than those subject to the fee increase. I do not suggest that this is appropriate administrative behavior, but it is effective. In an article in the National Post, this scenario was described as “about as bad an example of operating in bad faith as you’ll see on campus.”31

The principal principle The selection of route of processing persuasive messages is an example of an overarching theme I call the principal principle: Never underestimate people’s ability to identify their own self-interest and act accordingly. This principle is not a new insight. Aristotle taught the following: In deliberations about the future the means are exhortation and dissuasion and the topics that figure most prominently are the worthy and the worthless or the advantageous and the injurious . . . when we are trying to persuade a person to do something we try to show that the recommended course of action is either a good in itself . . . or something that will benefit the person.3 The concept of benefiting the person should be interpreted very broadly. If, as noted earlier, recognition is the coin of science, then the common wisdom is to cite people who might be called upon to review your manuscript or grant. On the barnyard level of scientific interaction, another piece of advice to young investigators is never to submit an application to a panel that has a direct competitor among its members. Both rules simply state the obvious truth that a proposal, no matter how persuasively written, will have difficulty convincing a hostile audience. A highbrow illustration of the principal principle can be seen in Ceccarelli’s work Shaping Science with Rhetoric.32 In brief, Ceccarelli examined three books by prominent and distinguished authors (Dobzhansky’s Genetics and the Origin of Species, Schrödinger’s What is Life?, and Wilson’s Consilience) who each addressed

two groups with conflicting interests. The successful books (ie, influential in their scientific fields) promoted a conceptual shift that created possibilities of collaboration and furtherance of each group’s interest. Moreover, the successful books used polysemy so that the two communities developed contradictory readings of the texts. Wilson’s book failed because he favored one side’s view over the other. At various times in my career, I have had the opportunity to serve on panels that were preparing advice for guidelines on products or tests. Such panels had representatives from industry, academe, and regulatory bodies. Name tags were unnecessary to identify affiliations; the participants’ arguments, analyzed through the lens of their self-interest, located their positions. Manufacturers knew which guidelines would make their product appear best. Factions within the manufacturers knew which guidelines would best promote their own departments (such as microbiology). Independent testing laboratories knew which guidelines would bring them more business. Academics knew which recommendations would likely bring them grants. Each group furthered its own interest. Aristotle likely would have been disappointed but not surprised.

Theory of cognitive dissonance Cognitive dissonance occurs when a person simultaneously holds two inconsistent beliefs or opinions. Such a state of inconsistency is so uncomfortable that people strive to reduce the conflict in the easiest way possible. Cognitive dissonance has been extensively used in propaganda. The main application is known as the rationalization trap.33 First, the propagandist arouses feelings of dissonance by threatening the audience’s self-esteem. Then the propagandist provides a solution, a way of reducing the dissonance by complying with a request. The rationalization trap is common in sales pitches. At a continuing education course, I heard the following argument: Why should hardworking dentists have to reduce their fees and their income by accepting what the insurance company offers for various procedures? In doing so, the presenter argued, dentists not only reduce their income but rupture the sacred dentist-patient bond by involving a third party. The presenter suggested that the solution is to transfer the responsibility of collecting insurance payments (and accepting the shortfall between reimbursement levels and fees) to the patient. This approach, the presenter argued,

48

Brunette-CH04.indd 48

10/9/19 10:36 AM

The Persuasion Palette

maintains income, facilitates direct patient-dentist discussion, and makes patients value their dentists’ work more because they know its cost. Moreover, this approach could be easily implemented by hiring the presenter’s management team to train the dental office staff.

The law of cognitive response The information-processing model of understanding persuasion views the audience as rational beings who sequentially process information. To persuade, this model argues, the persuader first has to attract people’s attention, then have them understand the message, learn the arguments underlying it, and come to accept the arguments as true. Persuaders then should teach the audience the arguments so that they will come easily to mind. The cognitive response approach to persuasion was developed in the late 1960s, in part because the information-processing model did not accurately predict people’s response to the mass media. To summarize, investigators found that the mass media did not necessarily tell people what to think, but the media did tell them what to think about and how to do it. Moreover, it was found that a message could be persuasive even if it failed to accomplish some of the information-processing stages. Even more damaging to the concept of rational persuasion was that some messages could be persuasive even if their arguments were not understood.34 The law of cognitive response holds that the successful persuasion tactic is one that (1) directs and channels thoughts so that the audience thinks in a manner agreeable to the communicator’s point of view; (2) disrupts any negative thoughts; and (3) promotes positive thoughts about the proposed course of action. The law of cognitive response appears to be the principle used, almost instinctively, by authors in the introduction of a paper. The author selects which references will be used to frame the background of the research and emphasizes the particular aspect of the problem that will be studied. I use the term problem deliberately; the typical overall approach used in the introduction (at least in shorter scientific papers) is the problem-solution framework. To build interest, the introduction promotes the importance of the problem, emphasizes the novelty of the study (to disrupt the negative thought “Hasn’t this been done before?”), and emphasizes the merits of the particular approach that has been adopted in the paper.

In fact, the law of cognitive response underlies the general principle of setting an agenda to implement the wishes of those in power. Years ago, the Medical Research Council (MRC) of Canada attempted to alter the composition of its grants committees to reflect the changing pattern of research. The attempt failed because in the consultation process, which mainly involved scientists who already had grants, the predominant emotion was fear; spreading the range of topics being studied posed the risk of reducing the funding for those who already had grants. If the funding pie were sliced into more pieces, the current grantees worried that most must get less and some might get nothing at all. A second attempt at reorganization, however, was successful and resulted in the transformation of the MRC of Canada to the Canadian Institutes of Health Research (CIHR). The law of cognitive response was an unacknowledged but important participant in this transformation. First, the MRC hired a professional firm to manage the process. The background and agenda that they produced emphasized the positive possibility that the pie of research funding was going to get bigger and could support additional slices. The agenda was so arranged that discussion of negative aspects of forming CIHR never seemed to occur, but there was constant repetition of the benefits. The emotion aroused by the proposed transition was greed, and the proposal smoothly sailed to adoption. Setting a positive framework for discussion also occurs in grant proposals and responses to reviewers. I was once involved in the assessment of proposals for the establishment of the Materials Research in Science and Engineering Centers for the United States National Science Foundation. These centers are large and multidisciplinary, and the funding is considerable. The center directors and their staff are often eminent in their fields; indeed, some research teams numbered Nobel Prize winners among their members. Befitting such important and costly decisions, the reviewing process is intense and multistep, so that the final panel sees the reviews of the proposed centers from many scientists. Such intensive scrutiny, of course, produces many negative comments. Prior to the final assessment, the applicants can respond to the reviewers. From a rhetorical point of view, the interesting aspect was that each of these highly skilled grant-getters adopted the same approach. Rather than wading in and addressing the criticisms one by one, they first set the stage in a positive manner by noting the many positive comments made by the reviewers. Also true to the spirit of the law of cognitive response, the negative

49

Brunette-CH04.indd 49

10/9/19 10:36 AM

RHETORIC

comments were downplayed and the benefits of funding the centers emphasized.

The soapbox effect and digressions Two obvious and annoying ways of framing the climate of discussion are the soapbox effect and digression. In the soapbox effect, one discussant repeatedly makes the same points and so is said to be “on the soapbox,” a reference to an era when politicians made improvised platforms of soapboxes from which to spew forth their views. The three key elements of the soapbox effect are (1) to prevent other discussants from voicing their views by hogging the discussion time; (2) to confuse audiences that message length equates message strength; and (3) to try to have the last word in a discussion, because audiences tend to think that the last words constitute the conclusion of a discussion. Digression is a related technique in which a discussant speaks on a matter unrelated or only marginally related to the topic of the discussion. As in the soapbox effect, the goals of digression are (1) to prevent opponents from expressing their views, simply because the digresser holds the floor; (2) to give the digresser a veneer of expertise, because he can digress on topics about which he actually knows something (or at least thinks he knows something); (3) to respond to a valid criticism by attempting to distract the audience. The latter stratagem is also known as the red herring method, which refers to the technique of drawing a red herring across the track of a hunted fox, thereby causing hounds to lose the scent.

The law of cognitive response as applied to clinical and academic dentistry Although not emphasized in this text, persuasion plays a key role in clinical dentistry, because patients sometimes must be persuaded of a treatment plan’s merits. In this context, the key principle recommended in such texts as Tough Questions, Great Answers35 is similar to the advice in a popular song circa 1944 with lyrics by Johnny Mercer: You’ve got to accentuate the positive, Eliminate the negative, Latch on to the affirmative, And don’t mess with Mr In-between. Accentuating the positive is a prime principle of spin-doctoring. In clinical persuasion, this involves explaining not only what you do but why, with emphasis on the benefits to the listener (thus invoking the

principal principle and pathos), as well as the integrity of the practice (ethos). Another rule is to keep the message brief but to develop positive points and themes (logos element). Another key point of the law of cognitive response is never to repeat a negative allegation. For example, it once was alleged that the rapid rise in tuition costs at our faculty would reduce the quality of the dental students. The administrators responded that we had over 200 applicants that met our requirements—an answer that does not really address the concern, because the faculty wants not 200 qualified applicants, but a top 40 of the highest quality for the program. The response relates to the question, because, obviously, the faculty needs at least 40 warm bodies to fill the available spaces, but it does not address whether the quality of the students who enter the program is as good as it would be if a lower fee structure were in place. Another spin-doctoring approach is to turn a negative point into a positive point. For example, I once knew of an endodontics department that had problems arranging sufficient coverage in a student dental clinic; the students complained. The spin-doctored response was: “We are not satisfied if even one appointment is missed, and we are working hard to get more endodontic instructors.”

Audience as “cognitive misers” One rationale for the success of strategies such as the peripheral route to persuasion and the law of cognitive response is that people are “cognitive misers” and try to minimize the amount of energy used to solve problems. This concept is not new; it harkens back to the Zipf principle of least effort (see chapter 1) and to the view of Sir Joshua Reynolds, the 18th-century portrait painter, who is cited as saying, “There is no expedient to which a man will not resort to avoid the real labor of thinking.”36 One approach people use to simplify the task of making judgments is the use of heuristics, simple rules to solve complex problems. Using a simple rule rather than a complex calculation and assessment of factors reduces the cognitive load. Faced with a purchasing decision, one might use the rule “you only get what you pay for” as a rationale for choosing the more expensive model. Accepting a simple case study with n = 1 would conform with the heuristic principle of believing in low numbers and ignoring the difficulties of statistics. Judgment under uncertainty and the use of heuristics is covered in more detail in chapter 22.

50

Brunette-CH04.indd 50

10/9/19 10:36 AM

The Persuasion Palette

The snow job is another example of using the cognitive miserliness of an audience for persuasive purposes. In this technique, more information is provided than realistically can be processed by the audience, which is evident in dental product advertisements packed with detail or grant applications supported by multiple appendices. In fact, to thwart this tactic, some granting agencies have reduced the amount of material that can be submitted. Messagedense advertisements can present information faster than people can process it; advertisers present in 30 seconds a message that would take 35 seconds at a normal speaking rate.34 One defense mechanism to the snow job tactic is to develop the skill of developing counterarguments quickly.

The Cialdini principles of influence Cialdini has reviewed the seven basic principles of influence in his best-selling book Influence: Science and Practice,37 as well as in a review article in Scientific American.38 The principles have wide application, but here I will concentrate on how they apply to persuasion in science.

Contrast Contrast means to make differences apparent. For example, prior to the popularization of tooth bleaching, some dentists told their patients that a sun tan was the solution to making teeth look whiter. In scientific literature, however, as the size of an effect is a factor in determining the impact of a study, authors develop the “art” of making favorable comparisons. The chapter on presenting results (chapter 14) details the common practice of making the differences between groups treated differently appear large. Chapter 20, on experiment design, advocates developing conditions that maximize the difference between groups.

Reciprocation Reciprocation is the norm that obliges individuals to repay in kind what they receive. This tendency is so strong that it can even overcome dislike.27 Often used in commerce (eg, free samples at food stores or “gifts” such as address labels from charities), it can be very effective; when a charity included a gift in its solicitation letter, the response rate almost doubled from 18% to 35%.37 Another example: I believe that professors

as a group tend to be a frugal lot, and I have observed many of my colleagues lining up at Costco for free food samples. Always anxious to investigate phenomena with appropriate methodology for my available resources, I adopted the participatory action method and lined up behind them and submitted myself to such dubious delights as mini pizza bagels. I observed that even some of my most haughty colleagues were ingratiatingly polite to the servers who controlled the size of the sample. But Costco is in a position to use a quantitative approach and acquired data to show an average 600% increase in frozen pizza sales after product sampling.39 Reciprocation also surfaces in concessions in negotiation. If a referee makes a comment on a manuscript, the authors typically modify their manuscript to reflect that concern. A troublesome aspect of reciprocation is illustrated by the finding that only 37% of authors who published conclusions that were critical of the safety of calcium channel blockers had received prior drug company support; however, 100% of authors supporting the drug’s safety had received something (eg, from free trips to grants and employment).40 It is hard to bite the hand that feeds you. Reciprocation also happens in science where scientific groups routinely cite each other; the exchange of these coins of recognition can lead to mutual liking and call up another factor in producing influence.

Consistency Consistency is the tendency of people to behave in accordance with a previous commitment. A car salesman will ask the buyer for a deposit to supposedly show the sales manager that the buyer is serious, but the real reason is to get the buyer to make a commitment to purchase the car, which the buyer will find psychologically difficult to reverse. In science, consistency is often invoked as a reason for policy. At a grants panel, a reviewer discussing a recommendation of a colleague might ask why funding for a technician for Dr X is not being recommended, when a technician was funded for Dr Y, who proposed similar work. This commitment to consistency is one of the ground rules of logic; reviewers have to treat like cases in the same way or be able to explain why not. Similarly, authors often are at pains to show that their results agree in some aspects with those of other workers.

51

Brunette-CH04.indd 51

10/9/19 10:36 AM

RHETORIC

Social validation Social validation, also known as the principle of social proof, refers to the finding that we view a behavior as correct in a given situation to the extent that we see others performing it.28 This occurs in the discussions in grants committees when panelists jump on a bandwagon. One panelist might enunciate a principle, such as, “We have to give the young scientists a chance.” Some will agree because they believe it to be true, but others will agree because it seems the right thing to do in this particular committee meeting. Certainly, bandwagons feature in committee meetings. For a number of years, my own grant proprosals to the CIHR dental sciences committee, though funded, were criticized for not being “molecular enough”; there were some on the committee who thought that it was more important to go “molecular” than to answer what were, in my own view, the most important questions. My response was twofold; I incorporated some experiments employing molecular biology (an example of reciprocation through concession), and, secondly, tried to emphasize why my other suggested approaches were appropriate. Sometimes bandwagons run off the road. Janis41 describes groupthink as the phenomenon of members of highly cohesive groups who, in striving for unanimity, override the realistic need to appraise alternative courses of action. Some years ago, I served on an MRC committee for several years that gave fellowships to professionals who wanted to develop their research skills. When I first saw the information for each candidate that the committee was given to review, I believed that there was no way the committee could do the task well. The committee was given such things as undergraduate and professional marks, which—as they came from different universities with different marking systems and standards—were difficult to interpret; biased reference letters from referees the candidate had chosen; and a letter of support from the laboratory to which the candidate wished to go (the laboratory would get a “free pair of hands” if the application were approved). Nevertheless, the committee worked hard, and the senior members, who were highly productive and impactful scientists, showed the junior ones, such as myself, how to complete the task. We learned from our teachers and achieved near consensus in our scores. We thought we had done well. Years later, however, a study found that the program was not as successful as hoped in its objective, namely, that of identifying and supporting people who developed research careers. We were victims of groupthink, as identified by many of

its salient characteristics: We shared stereotypes on what a successful candidate’s record should look like; we believed in our inherent morality (as hardworking volunteers); we exhibited self-censorship in the group by not questioning the types of information we were given, though we all had private doubts; and we ended up with an illusion of unanimity.

Liking Liking is important, because—as Cialdini posits—we most prefer to say yes to the requests of people we know and like. We tend to like people who are physically attractive and similar to us in their opinions, personality traits, background, and lifestyle. We also like people who like us; compliments are always appreciated, and flattery works even when we view it as untrue. Serving on grants committees in Canada is an unpaid and thankless task, but someone has to do it, and administrators must devise novel ways to convince people to do their duty. I once received a letter from the Deputy Director, Programs Branch, CIHR that stated: It is with great pleasure that I invite you to become Scientific Officer for the Multi-User Equipment and Maintenance Committee. Your ongoing achievements, the benefits of your expertise, and your valuable contributions to previous committees have all been recognized and commended by your peers. The flattery worked; I served on that committee and others. Liking also is part of the formation of “old boys’ networks” (such networks can also involve “old girls”). For the CIHR grants committee, it is customary that the committee goes out one evening to a splash-up dinner; everyone on the committee has worked hard, everyone is impressed with the scientific acumen of their committee colleagues, in vino veritas—all these factors contribute to liking. Everyone likes everyone else on the CIHR committee, in striking contrast to the usual academic milieu, where a few, at least, will be enemies or hostile competitors for scarce resources. Sometimes, a former committee member’s application comes up for review. As will be seen in chapter 22, on judgment, it is difficult to revise a model. In this case, once a panelist thinks of a colleague as an able, hardworking scientist, it is difficult to envision a serious scientific blunder. Indeed, when a well-established colleague makes a mistake in a proposal that

52

Brunette-CH04.indd 52

10/9/19 10:36 AM

References

would prove damaging to a newcomer, it tends to be interpreted simply as a minor slip.

Authority Authority is a form of inductive argument, and its use in that context will be covered in chapter 7. Authority ranks rather low in the hierarchy of evidence in evidence-based dentistry or medicine, but it carries considerable clout in everyday life. I find encouraging the studies that found the title of “Professor” makes strangers feel more accommodating toward a professor and causes them to perceive professors as taller. One study found that the same man is perceived as 2 ½ inches taller when described as “Professor” than when described as “a student.”42 Juries, I have heard, believe male professors above all other types of witnesses. On such evidence, I am predisposed to believe that some principles of influence lead to real social good (at least for vertically challenged male professors). The use of authority is common in science, because statements must be backed up by evidence, which is often given by a citation. Behind the citation are the data of the study, the authors, and the journal that published the study as well as, implicitly, the referees who reviewed the study. Sometimes, however, data are not available, and authority must serve as a substitute. This text, like many others, is sprinkled with statements either based on my own experience or supported by the use of opinions of prestigious authorities, such as the Nobel Prize winner Peter Medawar. Typically, authorities’ statements are used when judgment is required and higher forms of evidence are not available on the topic. As Medawar25 stated in the preface of Advice to a Young Scientist, “These are my opinions and this is me giving them … my judgments are not validated by systematic sociological research and are not hypotheses that have stood up to repeated critical assaults.” Another use of authority occurs in grant applications. Modern work is often multidisciplinary, and a typical applicant may wish to use techniques for which he or she lacks experience. To fill this gap, an investigator will recruit a collaborator who has an established publication record in the particular area.

Scarcity Scarcity surfaces because resources are always limited in scientific research, as are awards. Thus, those who garner the resources or awards garner recognition, and,

of course, recognition is the coin of science. Scarcity motivates authors to publish in high-impact journals. Those who publish in such journals must beat back stiff competition, and their work is viewed as more interesting or important than those who publish in lower-impact journals.

Summary The preceding examples demonstrate that rhetoric plays a large role in scientific discourse. Scientists must persuade to publish, to get grants, and to obtain recognition. They persuade and are persuaded by rhetorical techniques from everyday discourse. Because they fall into some of the same traps as the naive and unwary public, it is necessary for scientists to become familiar with techniques to separate the grains of truth from the rhetorical chaff.

References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11.

12.

13.

14.

Halloran SM. Technical writing and the rhetoric of science. J Tech Writing Commun 1978;8:77–88. Habinek T. Ancient Rhetoric and Oratory. Oxford: Blackwell, 2004. Corbett EPJ, Connors RJ. Classical Rhetoric for the Modern Student, ed 4. New York: Oxford University, 1999. Habinek T. The craft of rhetoric. In: Habinek T (ed). Ancient Rhetoric and Oratory. Oxford: Blackwell, 2004:38–59. Habinek T. Rhetoric and the state. In: Habinek T (ed). Ancient Rhetoric and Oratory. Oxford: Blackwell, 2004:1–15. Lanham RA. A Handlist of Rhetorical Terms. Berkeley: University of California, 1969. Gross AG. The Rhetoric of Science. Cambridge: Harvard University, 1990. Horton R. The rhetoric of research. BMJ 1995;310:985–987. Newman S. Aristotelian rhetorical theory as a framework for teaching scientific and technical communication. J Tech Writing Commun 1999;29:325–334. Greenhalgh T. Commentary: Scientific heads are not turned by rhetoric. BMJ 1995;310:987–988. Boswell J, Starer D. Five Rings, Six Crises, Seven Dwarfs, and 38 Ways to Win an Argument: Numerical Lists You Never Knew or Once Knew and Probably Forgot. New York: Penguin, 1991. Orsinger RR. The Role of Reasoning in Constructing a Persuasive Argument. http://www.orsinger.com/PDFFiles/constructing-a-persuasive-argument.pdf. Accessed 16 Feb 2018. Waddell C. The role of pathos in the decision-making process: A study of the rhetoric of science policy. In: Harris RA (ed). Landmark Essays on Rhetoric of Science: Case Studies. Mahwah, NJ: Hermagoras, 1997:127–150. Paul D. In citing chaos: A study of the rhetorical use of citations. J Business Tech Commun 2000;14:185–222.

53

Brunette-CH04.indd 53

10/9/19 10:36 AM

RHETORIC

15. Reif-Lehrer L. Grant Application Writer’s Workbook. Boston: Jones & Bartlett, 2002. 16. Halloran SM. The birth of molecular biology: An essay on the rhetorical criticism of scientific discourse. In: Harris RA (ed). Landmark Essays on Rhetoric of Science: Case Studies. Mahwah, NJ: Hermagoras, 1997:3952. 17. Pratkanis A, Aronson E. The credible communicator. In: Pratkanis A, Aronson E (eds). The Age of Propaganda: The Everyday Use and Abuse of Persuasion. New York: Freeman, 2000:121–127. 18. Haskins CH. The Rise of Universities. Ithaca: Cornell University Press, 1923:77–78. 19. Latour B, Woolgar S. Laboratory Life: The Construction of Scientific Facts. Princeton: Princeton University, 1986. 20. Sollaci LB, Pereira MG. The introduction methods, results, and discussion (IMRaD) structure: A 50 year survey. J Med Libr Assoc 2004; 92:364–367. 21. Spicer K. Think on Your Feet: How to Organize Ideas to Persuade Any Audience. Toronto: Doubleday, 1986. 22. Anholt RRH. Dazzle ’Em with Style: The Art of Oral Scientific Presentation. New York: Freeman, 1994. 23. Kurland DJ. I Know What It Says—What Does It Mean? Belmont: Wadsworth, 1995. 24. Gopen GD, Swan JA. The science of scientific writing. Am Sci 1990;78:550–558. 25. Medawar PB. Advice to a Young Scientist. New York: Basic Books, 1979. 26. Abelson RP. Statistics as Principled Argument. Hillsdale: Erlbaum Associates, 1995. 27. Tufte ER. The visual display of quantitative information. Cheshire, CT: Graphics, 1983. 28. Cialdini RB. Influence: Science and Practice, ed 4. Boston: Allyn & Bacon, 2001. 29. Simons HW. Persuasion in Society. Thousand Oaks, CA: Sage, 2001.

30. Kahneman D. Thinking, Fast and Slow. Toronto: Anchor Canada, 2013. 31. Perreaux L. Law students vote to double tuition for incoming class. National Post. 6 Feb 2003. 32. Ceccarelli L. Shaping Science with Rhetoric: The Cases of Dobzhansky, Schrödinger, and Wilson. Chicago: University of Chicago, 2001. 33. Pratkanis A, Aronson E. The rationalizing animal. In: Pratkanis A, Aronson E (eds). Age of Propaganda: The Everyday Use and Abuse of Persuasion. New York: Freeman, 2000:40–47. 34. Pratkanis A, Aronson E. Mindless propaganda, thoughtful persuasion. In: Pratkanis A, Aronson E (eds). Age of Propaganda: The Everyday Use and Abuse of Persuasion. New York: Freeman, 2000:33–39. 35. Wright R. Tough Questions, Great Answers: Responding to Patient Concerns about Today’s Dentistry. Chicago: Quintessence, 1997. 36. The quintessential innovator. Time 1979;114(17). 37. Cialdini RB. Reciprocation: The old give and take. In: Cialdini RB (ed). Influence: Science and Practice, ed 4. Boston: Allyn & Bacon, 2001:19–51. 38. Cialdini RB. The science of persuasion. Sci Am 2001;284:76–81. 39. Pinsker J. The psychology behind Costo’s free samples. The Atlantic. https://www.theatlantic.com/business/archive/2014/10/ the-psychology-behind-costcos-free-samples/380969/. Accessed 1 March 2019. 40. Yaphe J, Edman R, Knishkowy B, Herman J. The association between funding by commercial interests and study outcome in randomized controlled drug trials. Fam Pract 2001;18:565–568. 41. Janis I. Victims of Groupthink. Boston: Houghton Mifflin, 1972. 42. Wilson PR. The perceptual distortion of height as a function of ascribed academic status. J Soc Psychol 1968;74:97–102.

54

Brunette-CH04.indd 54

10/9/19 10:36 AM

5 Searching the Dental Literature Helen L. Brown, MLIS, MAS, MA Donald Maxwell Brunette, PhD

B

ecause we look up information of some sort every day, finding information can seem easy. However, finding the right evidence to guide clinical decision making requires familiarity with the resources, using appropriate research and search strategies, and evaluating the evidence. The sheer volume of scientific research produced every day and constant changes to journals and other resources makes it challenging to find research and stay up-to-date. The following chapter discusses tools and strategies that can help you find the right information quickly, build effective searches, evaluate sources to determine reliability, understand scholarly publishing trends, make the best use of the available tools, and stay current and well informed.



ʻData! Data! Data!ʼ he cried impatiently. ʻI canʼt make bricks without clay.ʼ” SIR ARTHUR CONAN DOYLE 1

When to Seek Help Librarians are experts in finding and managing information, and as is the case with most skills, it can be most effective to learn directly from an experienced teacher who knows all of the tricks and potential pitfalls. For students and beginning professionals (or those who simply want a review), library workshops, one-on-one consultations, guides, tip sheets, and video tutorials can provide tested research strategies and help develop literature search skills. Common topics include effective searching, selecting the right resources, evaluating information, managing references, and leveraging available resources such as pointof-care and current awareness tools. Many guides and videos created by dentistry or health librarians are openly available online. Even experienced researchers may need to collaborate with a librarian when conducting a comprehensive search or managing scholarly communications issues. For example, it is highly recommended to include a librarian as a member of the research team

To Dave, Ada, and Sybil, for their love, encouragement, and silliness. —Helen L. Brown

55

Brunette-CH05.indd 55

10/21/19 8:03 AM

SEARCHING THE DENTAL LITERATURE

when conducting knowledge synthesis studies such as systematic reviews. Librarians can also help answer questions related to publishing and copyright, meeting open access requirements from funding agencies, and managing research data.

Evidence-based dentistry Evidence-based dentistry (EBD) is based on a wider shift in health education and practice. It is a practice-based approach that focuses on integrating evidence and patient preferences into clinical decision making (Fig 5-1). The American Dental Association defines EBD as “an approach to oral healthcare that requires the judicious integration of systematic assessments of clinically relevant scientific evidence, relating to the patient’s oral and medical condition and history, with the dentist’s clinical expertise and the patient’s treatment needs and preferences.”2 One difference between EBD and other informationgathering practices is that EBD aims to decrease bias in the selection of evidence through advanced search techniques and critical appraisal of existing research. The following steps are the main methods for quickly identifying, appraising, and implementing evidence in clinical practice. The five As are discussed further in Evidence-Based Dentistry: Managing Information for Better Practice.3 1. Ask the question. Formulate an answerable question using a model such as PICO (described later in this chapter). 2. Acquire the evidence. Use your question to create an appropriately thorough search using resources that will best answer your question. 3. Appraise the evidence for validity and clinical relevance. 4. Apply the evidence to practice. 5. Assess the results and modify practice as needed. As with any method, EBD can seem challenging and time consuming at first but becomes easier with practice. The goal is to create an unbiased literature search that helps you access the best research evidence, rather than limiting yourself to only the research that you expect to find. Below are tips to help you. Recording information about your searches, evidence assessments, and clinical decisions will ensure that you have a record that you can refer back to and will prevent unnecessary duplication of work. Moreover, it provides a road map and record of your decision

Patient values

Best research evidence

EBD

Clinical expertise

Fig 5-1 Evidence-based dentistry.

making that can aid coworkers and help you to stay up-to-date, accountable, evidence based, and unbiased. One method is to document what you find in a file or binder. Another might be to organize and save relevant research using a citation manager such as EndNote, RefWorks, Zotero, etc. In addition, you can use the saving and alerts features within article databases to save searches and receive automatic email updates of new research published on a given topic. These strategies will be discussed in more detail later. Resources specifically on EBD may also be helpful in providing more detail about each of the above steps.3–7

Information Sources A common tool for assessing the quality of evidence is the evidence pyramid. As you can see in Fig 5-2, expert opinion and background or factual information that can help you to orient yourself to a new topic are located at the bottom of the pyramid, whereas filtered, synthesized information sources, such as systematic reviews and meta-analyses, are located at the top. Unfiltered information sources are studies reporting directly on research data and observations, while filtered information sources combine data from multiple sources and include a quality assessment to ensure that the included studies meet research methods standards that limit bias and inaccuracy. As filtered information sources include evidence from multiple studies, they report on data from larger populations and tend to decrease biases that may be present in any one study. The pyramid of evidence is a good starting place for evaluating research evidence and considering how it might factor into clinical decisions. That being said, it

56

Brunette-CH05.indd 56

10/21/19 8:03 AM

Information Sources

Meta analysis Systematic reviews

Critically appraised topics Searchable via individual databases

Filtered information

Critically appraised individual articles Randomized controlled trials (RCTs)

Cohort studies

Unfiltered information

Case-controlled studies Case series/Reports

Background information/Expert opinion

Fig 5-2 Evidence pyramid.8

is very important to critically appraise each source of evidence. For example, generally meta-analyses or clinical practice guidelines (CPGs) are excellent sources of filtered, synthesized research evidence, but it is possible for a researcher to publish a study as a meta-analysis without actually following the proper methodology. Similarly, CPGs are ideally built on systematic reviews or other rigorous study designs particular to that area of research (eg, a Delphi study), but there are also CPGs that are driven only by the opinions and experience of the authors. Also, it is important to remember that not all research questions are answerable by the same types of studies (eg, randomized controlled trials) and therefore it is important to think about what kinds of research are possible for a particular research question.

Systematic reviews and knowledge synthesis reviews Systematic reviews and other types of knowledge synthesis reviews, such as meta-analyses, scoping reviews, and rapid reviews, are original research studies that synthesize all of the relevant evidence on a given research question to find an answer that will help direct future research or practice. The aim of a systematic review is to conduct a thorough and unbiased search of all published and unpublished research, use a predetermined screening process to select studies, conduct a thorough quality analysis, extract and synthesize data, and pool the results. The results are then considered to be the highest form of evidence since

57

Brunette-CH05.indd 57

10/21/19 8:03 AM

SEARCHING THE DENTAL LITERATURE

Box 5-1 | Example of systematic versus narrative reviews The following example illustrates the difference between a narrative and systematic review. In 2008, Resnik and Misch published a narrative review titled “Prophylactic antibiotic regimens in oral implantology: Rationale and protocol,” which broadly reviewed the indication of antibiotic prophylaxis prior to implant surgery.9 They only cited the clinical outcomes from the Implant Clinical Research Group, which reported that antibiotic prophylaxis prior to surgery decreases implant failures from 10% to 4.5%.10,11 Oddly, this narrative review failed to mention the outcomes of eleven other articles relevant to the issue of antibiotic prophylaxis in implant dentistry that were identified and critically appraised in a systematic review authored by Schwartz and Larson a year before.12 This systematic review concluded that there was little evidence for the benefit of antibiotic prophylaxis when the public health risk of over-prescribing antibiotics is considered. Resnik and Misch may have had legitimate reasons for not discussing other potentially relevant articles in their narrative review, but without a transparent and standard methodology for selecting the included articles (ie, inclusion/exclusion criteria), the validity of their review’s conclusions is suspect. In 2013, Esposito, Grusovin, and Worthington presented the results of an updated systematic review/meta-analysis of only randomized controlled trials comparing the outcomes on patients undergoing implant surgery with and without antibiotic prophylaxis.13 Although they found a statistically significant reduction in implant failures with the antibiotic prophylaxis group, the difference between the two groups was less than that reported by Resnik and Misch. This example shows that conclusions can differ greatly depending on the methods used to collect and analyze data, so it is essential to be aware of these differences when evaluating research. Example provided by Dr Ben Balevi.

they synthesize all available high-quality studies. These studies follow very rigorous methodologies designed to ensure that the resulting review is unbiased, thorough, and reproducible. There are many variations of knowledge synthesis studies for different types of evidence and questions. There has been a significant increase in the number of knowledge synthesis studies produced in the last 10 years, and they are an excellent source of information for dentists and researchers who need a comprehensive analysis of the evidence on a topic. These knowledge synthesis studies are quite different from traditional narrative or literature reviews, which provide an overview of a topic and often summarize recent developments, but do so without following any particular methodology (Box 5-1). Narrative reviews have an inherent risk of bias as there are no standards or quality controls for how authors find or select the studies they include. They also have a tendency to reflect opinion or biases. When reading a systematic review or other knowledge synthesis review, it is important to assess whether the authors followed the appropriate methodologies. A cornerstone of knowledge synthesis reviews is that the review should explicitly report all of the methods including a full copy of the search strategy. As a result, readers can judge the depth of the review by considering how exhaustively the existing research on the topic was covered, the appropriateness and inclusivity of the selection criteria, and how the information was extracted and analyzed. This level of disclosure

enables readers to decide whether the process was sufficiently rigorous and therefore how much weight to give the results. To aid in this process, there are a number of appraisal tools and standards that can help both authors and readers to understand and appraise the methods used in a given review. Perhaps the best-known guidelines and appraisal tools are PRISMA, Cochrane, and AMSTAR. These and other highly used tools are listed in Box 5-2. AMSTAR 2 was published in 2017 and provides a useful set of criteria to consider when evaluating systematic reviews of randomized trials. The accompanying guidance document is particularly helpful for learning how to apply the tool to identify high-quality systematic reviews. The PRISMA checklist was developed to aid authors of reviews but can also be helpful for readers who need to quickly determine whether a systematic review meets the basic methodologic criteria required for a systematic review. For appraising other types of reviews, see some of the other tools listed in Box 5-2. Systematic reviews and meta-analyses generally aim to change practice through an original assessment and pooling of data. They are most effective when they pull together disparate, smaller studies that when combined provide more substantial evidence of the effectiveness of a particular intervention that can then be used to change clinical practice or settle longstanding disputes over treatment options. The systematic review featured in the Cochrane Collaboration’s logo is one example of a systematic review that significantly changed health care practice. The Cochrane

58

Brunette-CH05.indd 58

10/21/19 8:03 AM

Information Sources

Box 5-2 | Guidelines and appraisal tools Guidelines for conducting knowledge synthesis reviews • PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses: prisma-statement.org • Cochrane Handbook for Systematic Reviews of Interventions: training.cochrane.org/handbook • Cochrane MECIR standards: methods.cochrane.org/mecir • Centre for Reviews and Dissemination Guidance for Undertaking Reviews in Health Care: www.york.ac.uk/crd/guidance • Campbell Collaboration: www.campbellcollaboration.org • Joanna Briggs Institute Reviewer’s Manual: wiki.joannabriggs.org/display/MANUAL Critical appraisal tools • AMSTAR 2 (A MeaSurement Tool to Assess systematic Reviews): amstar.ca • Cochrane Risk of Bias Tool: methods.cochrane.org/bias/assessing-risk-bias-included-studies • SIGN Critical Appraisal: www.sign.ac.uk/checklists-and-notes.html • Centre for Evidence-Based Medicine Critical Appraisal Tool: www.cebm.net/2014/06/critical-appraisal • Joanna Briggs Institute Appraisal Checklists: joannabriggs.org/critical_appraisal_tools • Evidence-Based Medicine Toolbox: Systematic review (of therapy) critical appraisal worksheet: ebm-tools. knowledgetranslation.net/worksheet

Collaboration (www.cochrane.org) is an international nonprofit organization with review groups and members from more than 130 countries that facilitate the process of conducting systematic reviews. Following a structured methodology, the reviews examine and provide analyses of research results on specific topics in a variety of health areas and are considered the gold standard in systematic reviews of interventions. Cochrane’s logo shows a forest plot, or the graphical representation of a meta-analysis showing that the use of corticosteroids prior to preterm birth decreases infant mortality. Prior to the publication of this systematic review, there was no agreement or standardized use of this treatment. Following the results of this study, use of corticosteroids for pregnant women at risk of preterm labor became standard practice and, as Cochrane points out, has “probably saved thousands of premature babies.”14 So, how do you find systematic reviews? Most systematic reviews are searchable through article databases. Another important source of high-quality systematic reviews is the Cochrane Database of Systematic Reviews (CDSR). Many university libraries and other organizations provide access to CDSR. The Cochrane Library (www.cochranelibrary.com/) allows users to search for abstracts and open access full-text Cochrane reviews. You can also browse the Cochrane Library by topic (eg, “Dentistry and oral health”) or by review group, (eg, the Oral Health review group).

Guidelines Guidelines are another form of filtered or secondary research evidence that are highly useful for implementation in clinical practice and for brief overviews of health topics. Ideally, guidelines are based on a systematic review or meta-analysis, combined with knowledge from clinical practice and patient values in order to provide evidence-based clinical recommendations. As with other information sources, not all guidelines are produced to the same quality. Some guidelines are created using limited research evidence or inappropriate methodologies. Other issues include lack of transparency and conflicts of interest. Clinical Practice Guidelines We Can Trust is an excellent open access resource for more information about standards for developing guidelines, implementing guidelines in practice, and appraising guidelines for quality and reliability.15 Another useful resource is the AGREE II appraisal tool, which helps users to quickly assess guidelines for quality (www.agreetrust.org/agree-ii/). Guidelines are often published in journals and are available through databases, but many governments and associations also maintain collections of current guidelines for health care professionals (see Box 5-4).

Pre-appraised clinical tools Pre-appraised clinical tools or point-of-care tools are similar to guidelines in that they aim to provide summaries of evidence and clinical recommendations that are

59

Brunette-CH05.indd 59

10/21/19 8:03 AM

SEARCHING THE DENTAL LITERATURE

quick and easy to access even while seeing patients. Most are subscription-based resources such as BMJ Best Practice, Dynamed, and UpToDate and are available through libraries or institutional subscriptions.

Journals and peer review Many journal articles report directly on the results of research studies and thus are a type of primary or unfiltered evidence, while other journal articles appraise studies, report on changes, etc, and are therefore a type of secondary evidence. When an article is submitted to a peer-reviewed (or refereed) journal, it goes through an evaluation process; the editor sends the article to experts (referees) in the field for critical review before deciding whether to accept it for publication. The peer review process is a key characteristic of scholarly publishing that provides quality control. The articles published in refereed journals should be of high quality because of this process. However, there are no guarantees; the review process is not infallible and depends heavily on how well the reviewers are versed in critical appraisal techniques themselves. Therefore, it is always necessary to read the peer-reviewed literature using the critical appraisal skills described in this book. Some databases allow you to limit the retrieved articles to those appearing in refereed journals (see Box 5-4). Ulrich’s Periodicals Directory provides information about journals including whether a journal is peer reviewed. Journal Citation Reports (JCR) maintains a curated list of journals with citation-based metrics illustrating the journal’s impact or reuse within other journals in JCR (Box 5-3). Quality assurance is the goal of peer review, and using information from a peer-reviewed source is intended to provide more reliable and valid information. Health professionals may also receive many trade journals, which do not have a peer review process and often contain articles that are clinically focused. Although it is always important to view information critically no matter the source, because of the lack of peer review in trade journals, it is especially important to view this information critically. Unlike scholarly journals, trade journals do not follow a specific format, and their articles are often shorter and may contain glossy illustrations and images. References are not usually included in trade journal articles, and articles may include opinion or promotional materials.18

Books Books can cover a significant amount of material, serving as comprehensive sources of information on topics both broad and narrow in scope. Often an author of a nonfiction book has examined evidence and current knowledge in a subject area, compiling information into a comprehensive in-depth overview of a topic. However, a drawback of books is their lack of currency; publication of information in book format is usually a lengthy process and by the time a book is newly published, some of the information may already be several years old, and new editions usually are made available only once every few years. Some ebooks, however, are updated more regularly, as are some very highly used print texts.

Grey literature The term grey literature refers to research produced by organizations like governments, universities, non-governmental organizations, and businesses, that is not published through journal databases or traditional book publishers. Although it is tempting to dismiss grey literature because it doesn’t use a formalized peer review process, significant research is produced by organizations that either need to publish through other routes or have no incentive to publish in journal articles. For example, governments and non-governmental organizations typically make their research publicly available directly through their websites rather than through subscription-based journals aimed at academics. Examples of such research might include data from Statistics Canada or reports from the World Health Organization. Grey literature also includes conference proceedings, theses and dissertations, clinical practice guidelines, clinical trials, and clinical study or drug regulatory information. There are several reasons why it is important to search for grey literature in addition to searching the journal databases. For emerging, small, or practicebased fields, much of the current research may be published on association or institutional websites, shared at conferences, or created as part of a thesis or dissertation. Population-level data, public health topics, or research stemming from public services are all likely to be published online or through reports rather than via scholarly journals. Moreover, only half of clinical trials report their results, meaning that half of the evidence resulting from trials cannot be factored into

60

Brunette-CH05.indd 60

10/21/19 8:03 AM

Information Sources

Box 5-3 | Refereed dentistry journals (publication start year) ranked by 2017 journal impact factor16,17 Periodontology 2000 (1993)

6.220

Journal of Esthetic and Restorative Dentistry (1988)

1.531

Journal of Dental Research (1919)

5.380

Acta Odontologica Scandinavica (1939)

1.522

Oral Oncology (1965)

4.636

Australian Dental Journal (1956)

1.494

Clinical Oral Implants Research (1990)

4.305

Odontology (1941)

1.458

International Journal of Oral Science (2009)*

4.138

Gerodontology (1982)

1.439

Journal of Clinical Periodontology (1974)

4.046

Journal of Public Health Dentistry (1941)

1.436

Dental Materials (1985)

4.039

Dental Traumatology (1985)

1.414

Journal of Dentistry (1972)

3.770

International Dental Journal (1950)

1.389

Journal of Periodontology (1930)

3.392

International Journal of Paediatric Dentistry (1991)

1.383

Journal of Prosthodontic Research (1957)

3.306

International Journal of Dental Hygiene (2003)

1.380

Clinical Implant Dentistry and Related Research (1999)

3.097

Australian Endodontic Journal (1972)

1.371

International Endodontic Journal (1967)

3.015

1.367

Journal of Endodontics (1975)

2.886

Oral and Maxillofacial Surgery Clinics of North America (1989)

Journal of Periodontal Research (1966)

2.878

European Journal of Dental Education (1997)

1.343

Molecular Oral Microbiology (1987)

2.853

International Journal of Prosthodontics (1988)

1.333

European Journal of Oral Implantology (2008)

2.809

Implant Dentistry (1992)

1.307

Journal of the American Dental Association (1913)

2.486

British Dental Journal (1872)

1.274

Journal of Evidence-Based Dental Practice (2001)

2.400

Cleft Palate-Craniofacial Journal (1964)

1.262

Clinical Oral Investigations (1997)

2.386

British Journal of Oral and Maxillofacial Surgery (1963)

1.260

Journal of Prosthetic Dentistry (1951)

2.347

Progress in Orthodontics (2003)*

1.250

Oral Diseases (1995)

2.310

1.249

Journal of Oral Pathology and Medicine (1972)

2.237

International Journal of Periodontics & Restorative Dentistry (1981)

Caries Research (1967)

2.188

Brazilian Oral Research (1986)*

1.223

International Journal of Oral and Maxillofacial Surgery (1972)

2.164

Journal of Oral Implantology (1970)

1.212

Dental Materials Journal (1982)

1.205

Operative Dentistry (1958)

2.130

Journal of Advanced Prosthodontics (2009)

1.144

Orthodontics & Craniofacial Research (2002)

2.077

Cranio-The Journal of Craniomandibular Practice (1982)

1.094

Journal of Oral Rehabilitation (1974)

2.051

Quintessence International (1970)

1.088

Archives of Oral Biology (1959)

2.050

Journal of Dental Education (1936)

1.085

European Journal of Orthodontics (1979)

2.033

Journal of Periodontal and Implant Science (1971)

1.072

Community Dentistry and Oral Epidemiology (1973)

1.992

Journal of the Canadian Dental Association (1998)

0.978

Journal of Cranio-Maxillofacial Surgery (1973)

1.960

Oral Health and Preventive Dentistry (2003)

0.960

Dentomaxillofacial Radiology (1971) 

1.848

Community Dental Health (1984)

0.956

American Journal of Orthodontics and Dentofacial Orthopedics (1915)

1.842

Journal of Orofacial Orthopedics-Fortschritte der Kieferorthopadie (1931)

0.907

Journal of Oral and Maxillofacial Surgery (1943)

1.779

European Journal of Paediatric Dentistry (1990)

0.893

Journal of Prosthodontics-Implant Esthetic and Reconstructive Dentistry (1992)

1.745

Journal of Clinical Pediatric Dentistry (1977)

0.854

Journal of Oral Science (1958)

0.853

International Journal of Computerized Dentistry (1998)

1.725

0.818

Oral Surgery Oral Medicine Oral Pathology Oral Radiology (1915)

1.718

Swedish Dental Journal (1977) (formerly Odontologisk Revy, 1908) American Journal of Dentistry (1988)

0.760

Journal of Applied Oral Science (1993)*

1.709

Journal of Dental Sciences (2006)*

0.619

International Journal of Oral and Maxillofacial Implants (1986)

1.699

Seminars in Orthodontics (1995)

0.500

Journal of Adhesive Dentistry (1999)

1.691

Oral Radiology (1985)

0.466

Medicina Oral Patologia Oral y Cirugia Bucal (2004)

1.671

0.411

European Journal of Oral Sciences (1893)

1.655

Journal of Stomatology Oral and Maxillofacial Surgery (formerly Revue de Stomatologie de Chirurgie Maxillo-Faciale et de Chirurgie Orale) (1894)

Korean Journal of Orthodontics (1970)

1.617

Australian Orthodontic Journal (1967)

0.396

Head and Face Medicine (2005)*

1.606

Implantologie (1993)

0.138

BMC Oral Health (2001)*

1.602

Angle Orthodontist (1931)*

1.592

Journal of Oral and Facial Pain and Headache (1987)

1.538

* Open access

61

Brunette-CH05.indd 61

10/21/19 8:03 AM

SEARCHING THE DENTAL LITERATURE

Box 5-4 | Publicly available resources for finding and accessing research Databases and collections • PubMed: www.ncbi.nlm.nih.gov/pubmed • Cochrane Library: www.cochranelibrary.com • TRIP: www.tripdatabase.com • CADTH Rapid Reviews: www.cadth.ca • ACCESSSS from McMaster University: www.accessss. org • NICE Evidence Search: www.evidence.nhs.uk Search engines • Google Scholar: scholar.google.com • Advanced Google: www.google.com/advanced_search Guidelines • Canadian Medical Association (CMA) Infobase: joulecma.ca/cpg/homepage • National Guideline Clearinghouse: www.ahrq.gov/gam Clinical trials • ClinicalTrials.gov: clinicaltrials.gov • World Health Organization International Clinical Trials Registry Platform: apps.who.int/trialsearch • OpenTrials: explorer.opentrials.net EBD websites • ADA Centre for Evidence-Based Dentistry: ebd.ada. org/en • Centre for Evidence-Based Dentistry (CEBD): www. cebd.org • American Academy of Pediatric Dentistry: www.aapd. org/research/evidence-based-densitry • Dental Elf blog: www.nationalelfservice.net/dentistry Canadian oral health sources • Canadian Oral Health Roundtable Clearinghouse: www. oralhealthroundtable.ca • Canadian Best Practices Portal: cbpp-pcpe.phac-aspc. gc.ca • Canadian Oral Health Reports: www.caphd.ca/ canadian-oral-health-reports

assessments of effectiveness.19 This is particularly true for negative results, which creates a significant publication bias that has the potential to skew any conclusions stemming from the published research.19 Recent efforts to register trials and report all clinical trial data, as well as efforts to access clinical study reports created by pharmaceutical companies for regulatory agencies, provides a fuller, more accurate, and less biased set of data, particularly related to drugs and devices.

Searching for grey literature can be challenging but there are many guides available. One such guide is Grey Matters from the Canadian Agency for Drugs and Technologies in Health (CADTH). It provides a list of international grey literature resources.20 Many of the resources listed in Box 5-4 include grey literature. Other sources include databases that index conference abstracts such as Web of Science or Embase and institutional repositories and databases that publish theses and dissertations such as ProQuest Dissertations and Theses Global.

Open access, open science, and open data Open access (OA) literature is digital, online, free of charge, and free of most copyright and licensing restrictions.—Peter Suber19 Open access has a specific definition and is distinct from subscription-based resources. In this book, we have tried to distinguish between those resources that require subscriptions and those that are freely available for anyone to access and use. Open access increases the reproducibility of scientific research, encourages collaboration and faster feedback for authors, increases the impact and reuse of research, and ensures that users are able to access research irrespective of institutional collections. This is particularly important for those who work in private practice, are affiliated with hospitals or other institutions with small collections, or are from countries with less funding for research subscriptions. Open access journal publishing models are well established and provide free access to scholarly publishing. Examples of such initiatives are the Public Library of Science (www.plos.org) and BioMed Central (www. biomedcentral.com). There are many open access publishing models, including scholarly journals that are entirely open access, those that provide access after a delay (often 6 months or a year), and subscriptionbased journals with an option for authors to pay to make their article open access. Another open access model is commonly referred to as green open access and it involves institutional- or subject-based repositories where authors can deposit a copy of their journal article either pre– or post–peer review. Often these repositories will also accept other scholarly research outputs, such as books, video recordings, presentations,

62

Brunette-CH05.indd 62

10/21/19 8:03 AM

Citation Analysis and Research Impact

Box 5-5 | Open access directories • Directory of Open Access Journals (DOAJ): doaj.org/ • Directory of Open Access Repositories (OpenDOAR): www.opendoar.org/ • SHERPA RoMEO (searchable database of journal titles with information about their OA policies): www.sherpa.ac.uk/romeo/ index.php

• SHERPA Juliet (searchable database of research funders with information about their OA policies): v2.sherpa.ac.uk/juliet/

tools, podcasts, and data, in order to make them easily findable and accessible to the public. Most universities maintain institutional repositories to preserve and make available research by their faculty and students. An example of a subject-based repository is PubMed Central (www.ncbi.nlm.nih.gov/pmc). Open access publishing is entirely compatible with the peer review process, and it is very common for authors to publish their work in a subscription-based journal and deposit a copy of the peer-reviewed article in an open access repository as well. Most publicly funded grant agencies now require researchers to make publications stemming from their work open access. This ensures that publicly funded research is available to the public. There is also an increasing push toward open science and open data, with many research organizations and funding bodies now requiring that more of the scientific process and research data be made open access. For open science, this means making scientific research more transparent and reproducible by publishing research protocols and details on methods and tools. It also means making data publicly available where possible so that other researchers can verify and reproduce results, and so that data can be reused. An example is the Canadian Institutes of Health Research, which has adopted data initiatives and policies related to managing and preserving research data. Open access research can generally be found by searching online. Sometimes it can help to search within a specific open access repository (Box 5-5) or journal in order to help narrow down your search, but general searches using Google or other search engines work as well. If the article that you need is not open access or available to you through your institution’s subscriptions, you can also place an interlibrary loan request through a library that you are affiliated with or visit a library with a subscription to see if you can access it within the library. Because many open access journals charge author publication fees, one issue that has arisen is predatory open access publishing. Predatory publishers charge

authors to publish in their journals without providing appropriate peer review, editing, and other services to ensure the quality of the research they publish. Given the number of journals being produced, it is difficult to assess, identify, and list predatory journals, especially as some journals may begin as poorer-quality publications and gradually improve, or vice versa. However, there are ways to identify predatory journals. For example, they tend to charge large publication fees; solicit articles from researchers, often via email; and provide little or no peer review. To check if a journal is reputable, look at their publishing processes and ensure that they are transparent and clear. Also, check the Directory of Open Access Journals and reputable database indexes, such as Medline, CINAHL, and Embase, to see if the journal title appears on the list of journals they cover. Similarly, you can see if the journal is included in JCR. If you are unsure or have questions, you can also speak to a librarian.

Citation Analysis and Research Impact Citation analysis is a method of tracking the evolution of research as one author’s idea is accessed and built upon by another. It can also be used to track controversy in science as works are cited in scholarly discussions and disputes. Resources for citation tracking include Google Scholar, Scopus, and Web of Science. A search within Web of Science for this book’s author—“Brunette D*”—retrieves 205 results, which can be further refined to remove authors with similar names such as “Doug D Brunette” and “Deborah Brunette,” and authors with the same name but who work in different fields or institutions. In this case, the resulting list contains 168 articles that are likely to be authored by Don Brunette. Selecting the “Times Cited” number for a specific article reveals works that have

63

Brunette-CH05.indd 63

10/21/19 8:03 AM

SEARCHING THE DENTAL LITERATURE

Fig 5-3 Partial results of a search for articles authored by Don Brunette in 2001 using Web of Science. (Accessed 1 March 2019.)

cited the article, while the “Analyze Results” and “Create Citation Reports” features provide more in-depth citation metrics on each article and the author’s works as a whole. A search in the Google Scholar Advanced Search for “DM Brunette” retrieved 291 records. Some of the results from Google Scholar overlap with those found through Web of Science, but others are unique. Not only are different articles retrieved, but a look at the citation analyses for articles common to both searches reveals that the subsequent citation information also differs. For example, Web of Science showed the 2001 article published in the International Journal of Oral and Maxillofacial Implants as being cited 70 times (Fig 5-3), while Google Scholar reported it as being cited 106 times. Web of Science and Google Scholar are drawing on different sets of articles and therefore an exhaustive search would need to include both. A search in Scopus would also likely reveal different results. The use of citation analysis in the evaluation of papers and journals is addressed in chapter 22. Following the chain of research by looking at both the reference lists included in articles and books, as well as the citing research shown in Web of Science, Google Scholar, and Scopus, is a very useful way of finding research on a given topic, particularly when a topic proves difficult or elusive to search. Citation analysis is also used to measure the impact of journals. There are many different formulas for calculating journal impact metrics but the most wellknown is the journal impact factor (IF).

IF =

Total no. of citations in the year Total no. of articles published in the previous 2 years

For example, the journal Periodontology 2000 has an IF of 6.220 (see Box 5-3). This IF means that in 2015 and 2016, articles in this journal were cited an average of 6.22 times. Impact factors can be found in the JCR available through Web of Science, but there are many other journal citation metrics available as well, including metrics produced by Google Scholar. One of the most popular author impact metrics is the h-index. The h-index aims to provide a succinct measurement of an author’s research impact by measuring the number of articles published against the number of citations. The h-index: the number (h) of articles that have been cited at least h times. For example, if a researcher has an h-index of 5, they have five articles that have each been cited at least five times. The h-index can be useful for comparing researchers in the same field who have had similar periods of research productivity. Google Scholar, Scopus, and Web of Science all provide h-index metrics for authors based on the articles included in their sources. However, as with other citation metrics, the h-index has a number of biases and flaws. One drawback is that the h-index provided by any database will exclude research or citations that are not included in the journals they cover. This is particularly important for fields or types of research that are not well

64

Brunette-CH05.indd 64

10/21/19 8:04 AM

Conducting a Search

covered by that database or tool. Also, since they cover English-language journals from Western countries more heavily, research produced in other languages or journals, as well as research published as books, conference proceedings, and grey literature or in fields not covered by the database are not included in the h-index. Similarly, although the h-index is often used to compare researchers, it is not useful for comparing researchers from different fields as each field varies in terms of size and publishing patterns, with some being smaller or more focused on clinical applications or public health. The h-index also represents research over the course of a publishing career, and those researchers who have had longer periods of research productivity or who are in fields with high numbers of citations generally will have much higher h-index metrics. Higher-impact journals are also often cited at higher numbers, but these journals also tend to have fewer articles that result from a peer-reviewed submission process and more solicited content, meaning that they tend to feature works from well-established researchers known to the journal or editorial board. H-indexes also tend to be inflated for fields with higher levels of self-citations or those which tend to have multiple co-authors. While citation analysis shows the number of citations of an article or author within the set of publications covered by the tool, alternative metrics or altmetrics track other online indications of impact or reuse. For example, altmetrics can track number of downloads or online reads, number of online media or social media mentions, as well as feedback and comments. These metrics are often used in open access repositories to track impact and are increasingly being used in journal databases as well. Authors can also embed them in online portfolios. They are particularly helpful for showing the impact of work among academic online networks and among those who do not tend to publish in scholarly journals, including clinicians, government bodies, private business, and patients. Every type of metric has limits. For example, journal citation metrics typically are not comparable between fields and focus on Western, English-language journals. Author-level citation-based metrics and altmetrics may include self-citations or citations that were negative (eg, Andrew Wakefield’s retracted papers are a good example of papers with high levels of negative citations). It is also important to distinguish between different authors who have the same name by checking their affiliations and subject area. As a result, it is always useful to consider the field, use a variety of metrics when possible, and also include more qualitative analyses of impact.

Profiles Another way that researchers develop networks, promote their research, and demonstrate impact is through online profiles. These can include profiles on institutional web pages but also academic and professional profiles maintained by external companies. Platforms like Academia.edu, ResearchGate, and LinkedIn have become very popular in recent years. Another option is ORCID (orcid.org). ORCID is a nonprofit organization supported by member institutions. ORCID provides a unique persistent digital identification number that distinguishes authors from one another. Authors can build a profile and publication list within ORCID and can link ORCID with their other websites and profiles, as well as adding their ORCID IDs to articles. Many universities, publishers, agencies, and other organizations now require ORCID IDs. One thing to keep in mind when using these profiles is that they are online identifiers and portfolios rather than publication vehicles. Therefore, it is still necessary to publish works within traditional publications and open access repositories and then use these services as a means to link to and promote your work.

Conducting a Search Defining the question The first and probably most difficult step in finding information is to define the question. When beginning a research project, many people find that initially they only have a vague idea of what they are looking for, which often results in hours spent searching online and retrieving thousands of results—none of which provides the appropriate answer. Moreover, clarifying the question is an important step, because different types of questions require different resources. PICO is a well-developed model that helps refine clinical questions so that they can be translated into an effective and focused search strategy. PICO worksheets are available online from many health libraries.

P = Patient or problem I = Intervention or exposure C = Comparison (if relevant) O = Outcome(s)

65

Brunette-CH05.indd 65

10/21/19 8:04 AM

SEARCHING THE DENTAL LITERATURE

For example, if someone wanted to know whether soft drinks consumed by schoolchildren increase tooth decay, using PICO to identify the key concepts in the question would look like this: P = schoolchildren I = soft drinks C = not applicable O = rate of tooth decay The resulting question can be phrased as: Does consumption of soft drinks increase the rate of tooth decay in schoolchildren? Where should the researcher look first to find the answer to this question? A website sponsored by soft drink manufacturers? Probably not. Looking for a systematic review of the literature on this question or for research articles on this topic would be more effective.

Choosing a resource As mentioned earlier, defining the question guides the choice of resources. For finding information on the question “What is the blood supply to the mandible?” a book or atlas on anatomy will be most useful, whereas a question about antibiotic prophylaxis may be covered by a clinical practice guideline. Researchers with a specific patient-focused question might need to consult the journal literature to determine if there are relevant studies that address the question. If a book is the best source of information on the topic, searching a library’s general search and limiting to books is a good approach. The book record will indicate whether the library owns the book and where it is located, including whether it is available online. Searching for articles related to a research question or topic is usually a two-step process: (1) searching the library’s general search engine or within an article database using the most relevant PICO concepts from the question and (2) locating the relevant full-text articles. Often users can link directly to the full text of an article, but sometimes users may need to look up the journal in their library’s catalog or request the article from another institution (interlibrary loan).

Databases Most libraries have online guides that suggest article databases for each research area. Each database

specializes in a particular subset of the literature and can help you to find relevant information on your topic quickly. A typical dentistry research guide will direct researchers to specialized health indexes, such as PubMed, Embase, CINAHL, Cochrane databases, ACP Journal Club, and PsycINFO, as well as more general resources such as Web of Science and Google Scholar, depending on the library’s holdings. PubMed and Google Scholar are both freely available online.

Search techniques The following section covers search techniques that can be used in most databases as well as specific search strategies for PubMed and Google Scholar, two freely accessible and widely used resources. When searching, it is good to balance comprehensiveness with precision. If your goal is to find a specific article or only a few articles then it is preferable to create a precise search that finds only what you want. However, if your goal is to review the evidence for a particular treatment, then your search must be more comprehensive, as a more precise search will exclude research that could be relevant. Often we look for and find the research that we expect to find because our search terms and strategies mirror the information that we already have. A more comprehensive approach to searching can help to add other voices and evidence that may change our perspective. One technique to broaden a search is to search multiple resources, including databases and grey literature sources. All of the following search tips can be used to broaden or narrow a search depending on your needs and the features available in the resource you are using.

Keyword and subject heading searching In most databases, keyword or text searching only searches the article record rather than the full text of the article. Most of the time this means that when you search for a keyword, you are searching for terms in the title, abstract, author-added keywords, and perhaps a few other fields. With this type of search, the database searches only the actual word that you entered. A keyword search for something as seemingly straightforward as “teeth” would need to include not only the word teeth but also the words tooth, incisor, incisors, premolar, premolars, molar, molars, etc. Searching for “teeth” as a keyword will return only results that have that precise word. Thus, it is important to include all the terms that make up the concept—as well as

66

Brunette-CH05.indd 66

10/21/19 8:04 AM

Conducting a Search

singular and plurals—and any different spellings (eg, pediatric/paediatric). For some topics, it can be time consuming to form a list of all possible relevant terms. Another drawback is that, because the search retrieves anything with the term(s) entered, many results may contain the term(s) but may not be relevant. For example, a keyword search for information on aspirin would retrieve articles authored by John Aspirin, even though such articles may have nothing to do with the subject aspirin. Some databases also offer subject heading searching. One example is Medical Subject Headings (MeSH) produced by the US National Library of Medicine and used for indexing journal articles in Medline. Many other health databases such as Embase, CINAHL, and PsycInfo have their own subject headings. When you search for a subject heading, you first search a controlled list of terms (the subject headings) and then tell the database to return all of the articles with that subject heading added to it. For Medline, the database’s indexers look at each article and apply subject headings appropriate to the article’s content. This means that under the subject heading “myocardial infarction,” all the articles with that heading will be about myocardial infarction, even if the author has used the term “heart attack” or “MI” instead. The power of a keyword approach is its ability to comb through databases, looking for the occurrences of words, while the benefit of using a subject heading approach is that the search results are highly relevant. Developing an effective and complete keyword search can be difficult, and it may return irrelevant results. A drawback of searching using subject headings is that you may not find a suitable subject heading for your topic, or you may miss new articles if subject headings have not yet been added to the article record (Fig 5-4). There are techniques for searching by keyword or subject headings that allow a researcher to manage and combine results. Some common techniques, such as Boolean operators, truncation, and phrases, are described later.

Finding subject headings In databases with subject headings, most will offer a search feature that allows you to check if there is a subject heading on your topic. For example, on PubMed’s home page you can link directly to the database of MeSH, and Medline Ovid has a button that allows you to search or “map” your search term to a

subject heading. Often each subject heading comes with a scope note or definition that you can use to ensure that it fits the concept you are looking for. It is also helpful to look at where the subject heading is situated in relation to other terms. Often broader terms and narrower terms will be shown when you click on the subject heading, and you will have the option to select “Explode” in order to include all the narrower terms. The following example shows how the subject heading “beverage” is broken down further into narrower subject headings for articles on particular types of beverages: Beverage Alcoholic beverages Carbonated beverages Coffee Milk Milk substitutes Mineral waters Tea Once you have chosen your subject heading, the database might also allow you to choose a subheading, which can help you to narrow down the term to only a particular aspect of the concept such as economics or therapy. This can be helpful if your search retrieved too many results that were not specific to your topic. To broaden your search, include all of the subheadings.

Boolean operators The Boolean operators OR, AND, and NOT may be familiar to many readers. These simple words are powerful tools in the search for information. The term OR is commonly used to combine synonyms. As mentioned earlier, finding information about a particular concept using a keyword approach would require combining several terms in order to capture all the relevant information. For the question in the previous PICO example (“Does consumption of soft drinks increase the rate of tooth decay in schoolchildren?”), which includes more than one concept, each concept is built by including all synonyms and forms of the words. For the concept tooth decay, the keyword search might look something like this:  oncept A = decay OR decayed OR decays OR caries C OR carious . . .

67

Brunette-CH05.indd 67

10/21/19 8:04 AM

SEARCHING THE DENTAL LITERATURE

Fields frequently searched by keywords

MeSH

Fig 5-4 Article record in PubMed.

This retrieves results with any of these terms (Fig 5-5). For the concept soft drinks, the keyword search might look something like this: Concept B = soft drink OR soft drinks OR carbonated drink OR carbonated drinks OR carbonated beverages OR carbonated beverage OR pop OR cola OR colas . . . Once the major concepts are built, they must be combined with the operator AND. Using AND will tell the database to only return results that include one or more terms from each of your concepts (Fig 5-6). Concept A (tooth decay, etc) AND Concept B (soft drinks, etc.)

NOT is a term used to exclude results containing a specified term. This operator should be used with caution, because it is easy to inadvertently exclude relevant information.

Truncation Many databases or library searches allow you to place a symbol at the end of a word, which tells it to retrieve the word stem with different endings. A common symbol used for truncation is the asterisk (*). Most databases have a “help” or “search tips” link that explains which symbols the system uses. So, for Concept A (tooth decay), the search might look like this: Concept A = decay* OR caries OR carious

68

Brunette-CH05.indd 68

10/21/19 8:04 AM

Conducting a Search

Decay

OR

Caries

OR

Concept A

Carious

AND

Concept B

Concept A

Fig 5-5 Keyword searches using OR will retrieve sources that have any of the search terms.

Fig 5-6 Keyword searches using AND will retrieve sources that contain both search terms.

Note that enough of the word must be used to make the results meaningful; a search on cari* would return too many unrelated terms, such as Caribbean or caring.

keyword(s) using the Boolean OR. When the first concept is complete, search for the second concept by finding any relevant subject heading(s) for the second concept and then searching for keyword(s), then combining the two with the Boolean OR. Once you have a full search for each concept, you can add them together with AND. Below is an example of searching each concept separately and combining using the Boolean OR. Each line below is a separate search, which the database lists and numbers as you execute them.

Concept B’s search might look like this:  oncept B = soft drink* OR carbonated drink* OR C carbonated beverage* OR pop OR cola OR colas . . .

Phrases Sometimes it is more effective to search for a specific phrase using quotation marks (eg, “tooth decay”). Exact functionality depends on how the particular search engine operates, but generally the use of quotation marks will limit the results of the search to sources that contain the exact phrase. For example, if you search for “tooth decay” as a phrase with quotation marks, this search will not retrieve articles that state, “the tooth was decayed.”

1. 2. 3. 4. 5.

Parentheses

Searching PubMed

Finally, parentheses are used to appropriately group the terms and operators to control the order of the search.

While most databases use many of the previous features, PubMed and Google Scholar automate some of the search process, and therefore search strategies for these resources look a bit different. When you search within PubMed, the database automatically searches for keywords and subject headings (MeSH). It also automatically explodes subject headings. It might seem simple when you type a term in the search box, but PubMed is doing a lot of work interpreting your search behind the scenes. Most of the time, PubMed does an excellent job interpreting search terms, but sometimes you may need to modify the search. In order to see how PubMed is interpreting your search, open up the “Search details” box located on the right of the search results page (Fig 5-7). If there are any terms that are missing or terms that PubMed

(Concept A) AND (Concept B) (decay* OR caries OR carious) AND (soft drink* OR carbonated drink* OR carbonated beverage* OR pop OR cola OR colas)

Combining subject headings and keywords In most databases, it is best to build a comprehensive search by searching for one part of your question, or concept, at a time. To do this, search for the relevant subject heading(s), and then search using the relevant keyword(s). Combine the subject heading(s) and

dental caries/ [MeSH] decay* OR caries OR carious 1 OR 2 Carbonated Beverages/ [MeSH] soft drink* OR carbonated drink* OR carbonated beverage* OR pop OR cola OR colas 6. 4 OR 5 7. 3 AND 6

69

Brunette-CH05.indd 69

10/21/19 8:04 AM

SEARCHING THE DENTAL LITERATURE

has added that are unlikely to help your search, then you may need to adjust the search. To narrow your search by publication type, date range, population age, etc, use the filters or limits located on the left of the search results screen. The “Customize” and “Show additional filters” links allow you to access more options so that you can refine your search using the full range of tools. For example, to find only systematic reviews and meta-analyses on your topic, click the “Customize” link under “Article types,” add “Systematic reviews” and “Meta-analysis” from the drop down menu, and then click each desired limit or filter. For some searches, you may want to build your search concept by concept using the advanced search page (see the link under the main search bar). To save time, maintain a record of useful research, and stay up-to-date, create an NCBI account. Once you have an account, you can save articles to folders (called “Collections” in PubMed), save searches, and create alerts and RSS feeds so that the database will automatically run your search and send you any new results via email or RSS. You can also export search results to a citation manager, such as RefWorks, EndNote, Zotero, and Mendeley.

Searching Google Scholar Google Scholar is similar to PubMed in that it automates much of the searching process. This can be convenient but it also means that you have less control.

Google Scholar is a search engine that can help to find literature on a wide range of topics. Google Scholar does not show how it searches and tends to retrieve different results for different users depending on factors like their previous search history and geographic location. As a result, searches in Google Scholar are not reproducible. In addition, while sometimes the results you need will be right at the top of the results list, often the best research is buried very deeply in the search results. It is also important to note that the number of results listed at the top of the screen is only an estimate, and Google Scholar only shows the first 1,000 results (eg, in the following search Google Scholar estimates that there are over 30,000 results, but that number changes as you scroll through the results).21 However, Google Scholar has some very useful features for finding research (Fig 5-8) and is particularly useful for finding grey literature. When conducting a search in Google Scholar, keep it simple! Most of the search techniques described earlier, such as Boolean and truncation, work in databases but do not work in Google Scholar.22 You can also limit a search by date using the options on the left of the screen. That being said, there are also more advanced commands that you can use to narrow your search. One example is “Allintitle:” which is used to search for terms only in titles. Another is “Author:” which is used to search for author names. See two examples below: allintitle: systematic review dental caries author: “dm brunette”

Fig 5-7 PubMed features.

70

Brunette-CH05.indd 70

10/21/19 8:04 AM

Conducting a Search

Fig 5-8 Google Scholar features. Note: Google Scholar frequently updates its appearance and layout, so the exact location and appearance of these tools may vary over time.

Fig 5-9 Google Scholar metrics. Open the menu, click on “Metrics,” and select “View Top Publications” to browse by journal title or category. To learn more about the metrics click on “Learn More.”

For other options, use Advanced Search (see link in the menu in Fig 5-9) or search for a library guide on advanced Google Scholar commands. Additional tools to make research easier are found in Google Scholar’s settings (shown in Fig 5-9 and sometimes also shown as a gear icon). You can change the settings to link Google Scholar to your library’s collections, which can help you access subscription-only or paywalled research. You can also link directly to a bibliographic manager such as Zotero and RefWorks to easily export citations. As mentioned earlier, Google Scholar provides citation tracking and journal metrics (see Fig 5-9). It allows

users to create profiles and lists of their works, which can be helpful for increasing research impact and maintaining an online portfolio. It also shows article citations and links to the citing articles in order to follow the scholarly conversation. Citation tracking can be an excellent way of finding similar research, particularly on topics that are difficult to search. Google Scholar is useful for finding grey literature as it searches some organizations that may contain reports, data sets, and other works beyond just journal articles. Citation tracking using the “Cited by” links can help to identify grey literature used by other authors working in the same field.

71

Brunette-CH05.indd 71

10/21/19 8:04 AM

SEARCHING THE DENTAL LITERATURE

Fig 5-10 Advanced Google searching. Using the site or domain search and limiting the results to PDFs results in a search of UK government websites for PDF documents related to fluoride. Adding additional search terms would further refine the search.

Advanced Google Advanced Google searches web content but offers far more options than the basic search. For those conducting research, it is especially helpful for searching within specific websites or domains (Fig 5-10). • Site or domain: Enter the domain of a website that you would like to search and then add keywords in one or more of the search boxes. This feature is useful for searching organizations that produce and publish high-quality research or information on their websites. • File type: Change the file type to the type most likely to have the information you are looking for. For example, limiting the search to PDF retrieves mostly reports. When searching within Google, it is good to keep in mind that you are not searching a curated list of scholarly resources, so it is important to use the critical appraisal tools and strategies described in this book.

Final Search Tips • Define your question first and use PICO format for clinical questions. • Understand how research is published and select resources based on which ones will best answer your question, which are available to you, and the scope of your research (How much time do you have? How thorough do you need to be?). • When searching databases, select the appropriate keywords and subject headings for your key PICO elements, whenever possible. • Apply limits at the very end of your search. • Verify results are relevant to your research question and revise as needed. • Track what you have done and what worked. • Look at citations to find more relevant articles. • Appraise the research for accuracy, reliability, and bias.

72

Brunette-CH05.indd 72

10/21/19 8:04 AM

References

Although there is a large amount of information available, understanding what resources you have access to and how information is organized can help when selecting appropriate resources. Choosing the right resource and using the full range of search features will help you to find the best research evidence for your question quickly and efficiently. Moreover, tools such as saving and tracking searches, creating alerts, and saving articles will help you to track important research evidence and make evidence-based clinical decisions.

8.

9.

10.

11.

12.

Acknowledgment

13.

A special thank you to Kathy Hornby, MLS, DMD, who authored the previous edition of this chapter and provided many of the examples and text.

14.

15.

References

16.

17. 1.

2. 3.

4.

5. 6. 7.

Doyle AC. The Adventure of the Copper Beeches, Lit2Go edition. https://etc.usf.edu/lit2go/32/the-adventures-of-sherlockholmes/363/adventure-12-the-adventure-of-the-copper-beeches. Accessed 5 June 2019. American Dental Association. About EBD. https://ebd.ada.org/ en/about. Accessed 1 June 2018. Richards D, Clarkson J, Matthews D, Niederman R. Evidence-Based Dentistry: Managing Information for Better Practice. London: Quintessence, 2008. Clarkson J, Harrison JE, Ismail AI, Needleman I, Worthington H. Evidence Based Dentistry for Effective Practice. London: Martin Dunitz, 2003. American Dental Association Center for Evidence-Based Dentistry website. https://ebd.ada.org/en. Accessed 1 June 2018. Centre for Evidence-Based Dentistry website. https://www.cebd. org. Accessed 1 June 2018. American Academy of Pediatric Dentistry. Evidence-Based Dentistry. https://www.aapd.org/research/evidence-based-dentistry. Accessed 1 June 2018.

18.

19. 20. 21. 22.

Beck C, Zagar S. [Pyramid of evidence] [Image]. Created using the EBM Generator Trustees of Dartmouth College and Yale University, produced by Glover J, Izzo D, Odato K, Wang L. Resnik RR, Misch C. Prophylactic antibiotic regimens in oral implantology: Rationale and protocol. Implant Dent 2008;17:142– 150. Dent CD, Olson JW, Farish SE, et al. The influence of preoperative antibiotics on success of endosseous implants up to and including stage II surgery: A study of 2,641 implants. J Oral Maxillofac Surg 1997;55(suppl 5):19–24. Laskin DM, Dent CD, Morris HF, Ochi S, Olson JW. The influence of preoperative antibiotics on success of endosseous implants at 36 months. Ann Periodontol 2000;5:166–174. Schwartz AB, Larson EL. Antibiotic prophylaxis and postoperative complications after tooth extraction and implant placement: A review of the literature. J Dent 2007;35:881–888. Esposito M, Grusovin MG, Worthington HV. Interventions for replacing missing teeth: Antibiotics at dental implant placement to prevent complications. Cochrane Database Syst Rev 2013;7:CD004152. Cochrane. Cochrane Collaboration. The difference we make. https://www.cochrane.org/about-us/difference-we-make. Accessed 1 July 2018. Institute of Medicine. Clinical Practice Guidelines We Can Trust. Washington, DC: National Academies, 2011. University of British Columbia Library. Evaluating Information Sources. http://guides.library.ubc.ca/EvaluatingSources. Accessed 1 August 2018. All Trials. Half of all clinical trials have never reported results. http://www.alltrials.net/news/half-of-all-trials-unreported. Accessed 1 June 2018. Canadian Agency for Drugs and Technologies in Health. Grey matters: A practical tool for searching health-related grey literature. https://www.cadth.ca/resources/finding-evidence/ grey-matters. Accessed 1 June 2018. Suber P. Open Access. Cambridge: MIT, 2012. Clarivate. Journal Citation Reports. https://clarivate.com/products/journal-citation-reports. Accessed 1 June 2018. Ulrichsweb Global Serials Directory. Login Page. http://ulrichsweb.serialssolutions.com. Accessed 1 June 2018. Bramer WM, Giustini D, Kramer BM, Anderson P. The comparative recall of Google Scholar versus PubMed in identical searches for biomedical systematic reviews: A review of searches used in systematic reviews. Syst Rev 2013;2:115.

73

Brunette-CH05.indd 73

10/21/19 8:04 AM

6 Logic: The Basics

L



The Reverend Sydney Smith, a famous wit, was walking with a friend through the extremely narrow streets of old Edinburgh when they heard a furious altercation between two housewives from high-up windows across the street. ʻThey can never agree,ʼ said Smith to his companion, ʻfor they are arguing from different premises.ʼ” PETER B. MEDAWAR1

ogical analysis concerns the relationship between a conclusion and the evidence used to support it2 and has obvious relevance to the evaluation of scientific papers. An argument is the expression of logical analysis and consists of both a conclusion and supporting evidence. The statements of evidence are premises, which are ideally statements of fact (ie, a datum of experience known to be true, because it can be verified by others). Typically, however, because of the paucity of well-established facts, a less stringent criterion for the premises is used, namely, that they at least be plausible. Often in deductive arguments, the statements may be principles or ideals, such as “All men are created equal.” Premises may not support the conclusion in one of two ways: (1) the facts are implausible or not true, or (2) the facts are not appropriately related to the conclusion. The relationship between the premises and the conclusion is the domain of logic.

Some Basic Standards and Ground Rules In assessing an argument several ground rules apply. 1. Rationality: Those making an argument are expected to have reasons for their beliefs. 2. Consistency: Where two cases are similar, a reason for not treating them the same way must be offered. 3. Open-mindedness: The evaluator must be open-minded, that is, prepared to consider other points of view and, when evaluating logic, prepared to reason from premises with which one disagrees. 4. Balance: The evaluator should be concerned only with defects in proportion to the degree that they affect the conclusions. If the major conclusions are sound, there is little point in nitpicking a paper with minor criticisms. In analyzing logic, it is often necessary to clarify the issues and to analyze the arguments underlying the positions, which may be complex, having several arguments linked together. For each argument, it is important to identify the conclusions and the underlying

74

Brunette-CH06.indd 74

10/9/19 10:45 AM

Some Basic Standards and Ground Rules

evidence. This task is sometimes made difficult because of the convolutions of scientific writing. In addition, the terms of an argument must be used precisely. The measure of the completeness of an analysis depends on the number of valid and significant distinctions contained in the works being compared.3 The number of distinct terms is related to the number of distinctions. For example, early work in connective tissue spoke of fibers; subsequently, some fibers were called collagen; today no less than 28 types of collagen are distinguished. Thus, as the analysis of fibers has become more complete, the descriptive terms have multiplied.

Fallacy of no evidence: Assertions Fallacies are mistakes in reasoning, and they are made in various ways, such as emotional reasoning, invalid reasoning, and lack of precision in the use of words. A popular academic sport over the centuries has been naming fallacies and compiling long lists of them. Knowing the names and nature of some fallacies is useful in arguments because they enable one to quickly recognize and attack weaknesses in an opponent’s argument, as occurs for example in a court of law, or in the discussion section of a scientific paper. A common fallacy is the fallacy of no evidence, where some statements are not supported by evidence. Statements not supported by evidence are called assertions. Surprisingly, the simple technique of repeating a statement often causes people to believe the statement, particularly when the speaker has an air of authority. Although many simple assertions are obvious (eg, those made in radio or television commercials), others are more insidious. Statements of belief, apparently unsupported by evidence, can sometimes be found in the discussion section of scientific papers. The best way to deal with assertions is simply to challenge them by asking what evidence supports them. This strategy risks making the questioner appear ignorant, but the risk may be smaller than anticipated, for it is not unusual for the evidence of even widely accepted conclusions to be weak. Reasonable people have grounds for their beliefs, but often the evidence can be discovered only by questioning.

Fallacy of insufficient or inappropriate evidence Sometimes evidence presented for a conclusion does not seem to bear directly on the conclusion but, nevertheless, is related to the conclusion. For example, an advertisement directed at dentists for a toothpaste for children states that the abrasiveness of the product has been reduced. Included in the advertisement is a figure showing that children’s dentin is softer than adult dentin, implying that use of the toothpaste will reduce abrasion on children’s teeth and that dentists should recommend it to their pediatric patients. The data are correct; children’s dentin is softer than adults’ dentin, so the issue is not the truth of the data; rather, the problem is how the data relate to the implied conclusion. In my view, to make the conclusion that children need the specially formulated toothpaste sound, three additional pieces of evidence are needed: 1. Evidence that erosion caused by abrasive toothpastes is a significant problem; that is, there are children who required clinical treatment or who suffered pain because of erosion caused by standard toothpastes. Establishing this premise would require an epidemiologic study. However, some would argue that just the chance of reducing erosion on children’s teeth is sufficient reason to recommend a less-abrasive toothpaste. 2. Evidence that use of the children’s toothpaste actually reduces erosion. This might be demonstrated most convincingly by a clinical study in which users exhibit less abrasion than users of a standard toothpaste. However, it might be argued that a laboratory study could also demonstrate that the special formulation causes less erosion of children’s teeth. Extracted primary teeth could be assessed for erosion after being brushed with the new toothpaste or a standard brand for specified periods of time, using an apparatus that applied a known pressure on the brush. 3. Evidence that the new toothpaste is as effective as standard toothpastes in plaque removal and reduction in caries, because these are the principal reasons for toothpaste use. This simple example illustrates that what constitutes relevant or sufficient evidence can depend, as noted in points 1 and 2, on individual judgment. On other issues, such as point 3, there probably would be widespread agreement. Nevertheless, accepting any conclusion

75

Brunette-CH06.indd 75

10/9/19 10:45 AM

LOGIC: THE BASICS

and implementing action involves applying appropriate standards. Setting standards requires judgment, which in turn requires justification and is open to criticism. If too high a standard is set, it will seem that nothing can be known with certainty. For example, although the philosopher Descartes concluded that certainty is the impossibility of doubt, Fisher4 has argued that Cartesian skepticism is an inappropriate standard in normal circumstances. Excessively high standards for accepting conclusions lead to inaction; aging professors sometimes suffer from masterly inactivity, for if nothing can be established with certainty, why bother doing research? On the other hand, setting low standards for accepting conclusions leads to the tolerance of error and a condition of gullibility or naiveté.

The Assertability Question In The Logic of Real Arguments, Fisher5 advocates the assessment of conclusions by the use of the assertibility question (AQ): What argument or evidence would justify the assertion of the conclusion, or, expressed slightly differently, what evidence would I require before I believed the conclusion? The AQ can be an efficient tool for analyzing scientific articles, because it leads the questioner to focus on the critical issue of whether the data presented as evidence for a conclusion are relevant and sufficient. The answer to this question can be found often by scanning a paper’s abstract (or summary) and identifying the main conclusions. If the data in the paper are identified as relevant by the AQ, then the paper may deserve a closer look; otherwise, it may be a better use of time to look at other articles. For example, my answer to the AQ for the conclusion of the children’s toothpaste advertisement comprises the three points given earlier. When these points are not answered, I do not feel compelled to look more closely at other aspects of the advertisement (such as exactly how hardness of dentin was measured or the taste of the toothpaste). The AQ also forces the questioners to consider explicitly the appropriateness of the standards of evidence that they employed.6 In some fields of research, the standards of evidence may relate to the particular techniques used. In biochemical research, the standards for declaring a preparation of a particular protein pure have undergone continuous revision upwards as new methods of analysis (which reveal contamination) or preparation (which lower the possibility of contamination) are developed.

Types of Logic Several types of logic have been developed to deal with the various types of problems to be addressed and the evidence that is available as well as the relative certainty required in the conclusions. All of these are used in scientific research. As will be seen later, because of these varying standards, arguments that are accepted in one type of logic can be classed as fallacious in another. • Deductive logic is the tool used when attempting to link assumptions or evidence to a conclusion so that acceptance of these elements leads to acceptance of the conclusion. Conclusions established in this way are held to be absolutely certain. The most widely used forms of deductive logic are categorical syllogisms and implication based on conditional propositions. • Inductive logic, in contrast, is not certain. It operates by examining specific instances or multiple occurrences and arriving through some form of insight to postulate a more general principle, which can then be considered a hypothesis and be tested by further observations. Such reasoning entails a certain leap of faith and thus cannot establish conclusions with certainty. For example, one could observe the color of swans in Canada and inductively arrive at the conclusion that all swans are white. But that is an uncertain inference and further testing, in Australia for example, would find the conclusion to be erroneous. In modern times, inductive conclusions are often expressed in terms of mathematical probability such as the near ubiquitous “P < 0.05” found in many papers in dentistry and medicine. Important branches of inductive logic include argument by analogy, which will be covered in chapter 7, and cause-effect reasoning, which will be covered in chapter 8. Analogy forms a key component in forming research hypotheses, and cause-effect reasoning is at the heart of debates on public health measures and regulations. • Abduction, to be discussed more fully in chapter 7, is defined as reasoning that accepts a conclusion on the grounds that it explains the evidence, and is used when the best available evidence is weak. The conclusion of an abductive argument is that the truth of the conclusion is at least possible. In some instances, it can be considered as a kind of logic of discovery (ie, it provides a starting point for future investigation).

76

Brunette-CH06.indd 76

10/9/19 10:45 AM

An Introduction to Deductive Logic

• Informal or semi-formal logic. Readers of traditional texts on logic sometimes struggle to apply their learning to arguments or discussions in the real world where terms are used fast and fluidly. In such arguments, opponents can find themselves “fighting feathers” where positions change rapidly and the rigorous rules of formal logic are difficult to apply, and a more common-sense approach is required. A number of philosophers have advocated moving to more informal modes of logic. Perhaps the earliest and most renowned of pioneers of informal logic was Stephen Toulmin, who developed models of scientific explanation and argumentation6 that will be utilized in chapter 22. The Canadian philosopher Douglas Walton has developed a theory of informal logic7,8 based on the concept of a question-reply dialogue interaction between two participants, each representing one side of an argument. As noted earlier, such a dialogue is the interaction that happens among referees and authors of submitted papers. Walton’s approach is essentially pragmatic. It differs from inductive logic, which he sees as based on probability, in that informal logic or pragmatics is based on plausibility defined as “a matter of whether a statement appears to be true in a normal type of situation that is familiar.”8 The structure of argumentation schemes is similar to that of deductive logic in that it leads from premises to a conclusion. But the standard of proof differs. The conclusion of an informal logic argument is not certain; rather it yields a type of presumption that Walton defines as “a qualified, tentative, assumption of a proposition as true that can be justified on a practical basis, provided that there is no sufficient evident that the proposition is false,”8 a statement that is identical to the concept of “defeasibility.” Defeasibility is the basic goal of an author in responding to referees’ criticisms of conclusions in a submitted article. An author might well hope that referees will offer praise, but the author cannot retain any hope for publication if he or she agrees with a referee who demonstrates to a reasonable standard of proof that the conclusion is false. The strategic value of establishing a presumption is that it shifts the burden of proof to the other party; it thus requires a response from the opponent. So when a referee establishes a presumption that the conclusion of a submission is not accurate, the author is obliged to respond. Walton has developed systematic criteria in identifying some 96 recurrent argumentation schemes. Furthermore, for each argumentation scheme,

Walton proposes a set of “critical questions” that should be asked to attack the argument. Conversely, proponents might use the critical questions to test and design their arguments. Walton’s critical questions will be used in chapter 8 on cause-effect reasoning.

An Introduction to Deductive Logic A valid deductive argument is one in which the premises provide conclusive grounds for the truth of the conclusion. Deductive logic is valuable, because it establishes new relationships between different terms. Although most people consider themselves logical, a test of university students’ ability to solve simple problems in deductive logic found that, on average, they scored only 20%!9 If deductive logic is helpful in evaluating experimental research, it likely must be studied, because most people do not acquire logic naturally. Because certain errors and weaknesses in logic appear more frequently than others, learning the most common errors makes them easier to spot. The deductive logic section of this chapter intends not only to summarize the rules of fundamental deductive logic but also to indicate effective approaches to critique scientific presentations on the basis of logic. This is by no means a complete presentation; more detailed information on logic can be found in sources devoted to the topic.10–14 Deductive logic is widely used in law where in routine legal cases the law is applied in a deductive fashion, as when a statutory rule is applied to a particular set of facts to render a straightforward result.15 Formal deductive arguments are most frequently encountered in the discussion section of scientific papers. Scientists use deductive logic when they believe that they have established certain facts and wish to relate their findings to other information. The use of deductive logic in science, however, has a checkered history. Ancient Greek as well as medieval philosophers were more concerned with what should be, rather than with what actually was, and made no distinction among science, philosophy, and religion. Such speculative philosophers relied on a maximum of reasoning and hypotheses and a minimum of observation, particularly observations that verified the truth of the deductions. Observation was connected to practical work, which was believed to be inferior to contemplation. The modern term for this approach is armchair science. In the medieval world, it was thought that truth

77

Brunette-CH06.indd 77

10/9/19 10:45 AM

LOGIC: THE BASICS

Table 6-1 | Categorical statements Code

Quantity

Quality

Quantifier

Subject

Copula

Predicate

A (U)

Universal

Affirmative

All

Canadians (D)

Are

Brave people (U)

E (D)

Universal

Negative

No

Canadians (D)

Are

Brave people (D)

I (U)

Particular

Affirmative

Some

Canadians (U)

Are

Brave people (U)

O (D)

Particular

Negative

Some

Canadians (U)

Are not

Brave people (D)

about the physical world could be arrived at by the exercise of reason upon a few observed facts. Deductions, although valid, were sometimes based on false (or speculative) premises and thus were not sound. Consequently, deductive logic fell into disrepute as the primary means of scientific inquiry.16 The use of deductive logic is most productive when the axioms are well established, as in Mendelian genetics, but it also can work well in less-established schemes. Watson and Crick proposed that, on the basis of x-ray diffraction patterns, DNA comprised two strands linked by hydrogen bonds.17 Others deduced that this model could be tested by exposing DNA to conditions that rupture hydrogen bonds and seeing if single strands of DNA could be obtained. Such deduction requires expert knowledge. In this example, the scientists who deduced the consequences of the Watson-Crick model were familiar with physical chemistry. Similarly, some dental research also tests the deduced consequences of hypotheses. However, there is not, in many instances, a firm knowledge of underlying mechanisms, and the tests are not as rigorous as would be desired. A second use of deductive logic in scientific articles occurs in the discussion section. Although philosophers might argue that inductive logic used to establish some premises is never certain, scientists often will treat the results of their research as established facts. They combine these newly discovered facts with previously reported data to arrive at new conclusions or hypotheses. In some instances, it is the new relationship between the facts that is discussed, and the logic used to arrive at the conclusions is deductive.

Syllogisms Aristotle postulated that arguments could be cast into a basic pattern called the syllogism. A categorical syllogism consists of two premises and a conclusion. An example of a valid syllogism in a standard form is the following: All scientists are dishonest; Einstein was a scientist; therefore, Einstein was dishonest. This is a valid syllogism. If the premises are true, then the conclusion is true. The conclusion, however, would not be sound if either premise were false. An example of an invalid syllogism follows: Some men are philosophers; some men are Greeks; therefore, some Greeks are philosophers. Although probable, the argument is invalid, because the truth of the premises does not guarantee the truth of the conclusion. Testing the validity of syllogisms is not necessarily obvious, which is evident by the example of the probable but invalid argument. However, testing can be done rigorously and relatively simply, as will be discussed; there are, however, some problems. The first problem is identifying the structure of the argument, which requires the recognition that an appeal to reason rather than observation is being made. Second, the reader must determine which parts of the argument are premises and which are conclusions. The premises often can be identified by words such as since, for, on account of, and because. Conclusions are often preceded by indicators such as therefore, thus, hence, so, it follows that, consequently, and we may infer that. Authors can either state the premises before giving the conclusion or start from the conclusion and work backward. Imprecise terms pose another problem in analyzing arguments. Words can have emotional or evaluative overtones and take on different meanings, depending on the context. For example, the word or can be used

78

Brunette-CH06.indd 78

10/9/19 10:45 AM

An Introduction to Deductive Logic

I

No Canadians are brave

di ct or y

E

y or ct di ra nt Co

Some Canadians are brave

Contraries

Co nt ra

A

Subalteration

All Canadians are brave

Subcontraries

O

Some Canadians are not brave

Fig 6-1 The square of opposition. (Reprinted with permission from Copi.9)

in an exclusive sense, as in the phrase “take it or leave it,” in which or means either one, not both. Or is used in a weak or inclusive sense in the sentence, “If it is snowing or raining the game will be canceled.” Presumably the game would be canceled if a mixture of snow and rain was falling. Or in this inclusive sense means either; possibly both. In symbolic logic, the symbol for the inclusive sense of or is v. For an equivalent meaning to or in the exclusive sense, the symbol v must be used with two additional symbols in a particular combination. Although this sounds complex, it results in clarity. Unfortunately, ambiguity exists when arguments are stated in everyday language. Scientific arguments are made in everyday, although somewhat stilted, language and are not framed in the precise but abstract terms of the symbolic logician. A compromise that helps clear up ambiguity, but still enables the analysis to be conducted in everyday language, is translating the arguments into a standard form to obtain statements in the form of precise propositions.

Categorical statements The traditional four statements and their properties and representations are given in Table 6-1. The A and I statements are affirmative in quality, while the E and O (from nego, meaning I deny) are negative in quality. The quantity of a statement is universal if it refers to all members of a class, and particular if it refers only to some. A subject (and/or predicate) is distributed (indicated by the letter D in Table 6-1) if it refers to the whole class it names. Obviously, the subject of A and

E statements are distributed. A trickier problem is the distribution of the predicate terms. The predicate of the A statement (all Canadians are brave people) is not distributed (marked U in Table 6-1 for undistributed). Not all brave people are Canadians. The predicate of an E statement (no Canadians are brave people) is distributed, because, in asserting that the whole class of brave people is excluded from the whole class of Canadians, this statement asserts that each and every brave person is not a Canadian. Another convention is that a statement referring to a specified individual (eg, Socrates is mortal) is treated as a universal statement, because it refers to the whole of the subject (in this case, all of Socrates). Many categorical statements lack the standard form shown in our examples. The statement, “Some enzymes denature at high temperatures,” does not contain a form of the verb to be, but it can be altered with no change in meaning to “Some enzymes are heat-labile substances.” The altered statement has the standard form. Some common but nonstandard quantities easily can be translated to standard form. Terms such as every, any, everything, anything, everyone, and anyone usually can be interpreted as all. A, an, and the can mean either all or some, depending on the context. Phrases using only or none but generally imply that the predicate applies exclusively to the subject. These can be altered with no change in meaning to A statements (eg, “Only proteins are enzymes” becomes “All enzymes are proteins.”) The relationships between categorical statements are summed up in the traditional square of opposition (Fig 6-1). Two statements are contradictory if one is the denial

79

Brunette-CH06.indd 79

10/9/19 10:45 AM

LOGIC: THE BASICS

of the other. They cannot both be true, and they cannot both be false. Statements that differ in both quantity and quality are contradictory (eg, A contradicts O; E contradicts I). Two propositions are said to be contraries if they cannot both be true, although they might both be false. Two propositions are subcontraries if they both cannot be false though they might both be true. Any arguments that depend on numeric or probabilistic information for their validity are asyllogistic and cannot be analyzed by the methods of deductive logic.

Categorical syllogisms A syllogism is a deductive argument in which a conclusion is derived from two premises. In a valid syllogism, if both premises are true, the conclusion must be true. Categorical syllogisms contain three categorical statements, which use three terms twice. An example of a valid syllogism is the following: premise 1 premise 2 conclusion

All dentists are crooks. Doug is a dentist. Therefore, Doug is a crook.

The major term (crook) appears as the predicate of the conclusion. The minor term (Doug) appears as the subject of the conclusion. The middle term (dentist) appears in both premises but not the conclusion. While this syllogism is valid, the conclusion is unsound, because one of the premises (all dentists are crooks) is false. Rules for valid categorical syllogisms The syllogism examples with Greeks and philosophers and dentists and crooks have similar form, but one is valid, and the other is invalid. An argument is invalid when its conclusion is not justified by its premises; it is still possible for the conclusion to be found factual in subsequent investigation. For an argument to be considered as valid, it must follow four rules.

1. A valid categorical syllogism contains just three terms, which must be used in the same way throughout the argument. If a term is used in different senses, the fallacy of equivocation is committed. This fallacy is particularly easy to commit if a term is used imprecisely. Consider the syllogism: premise 1 premise 2

The causative agent of periodontal disease is plaque. Plaque is the causative agent of dental caries.

conclusion

The causative agent of periodontal disease is the causative agent for dental caries.

This syllogism has a valid form. However, if the plaque mentioned in the first premise is subgingival plaque while the plaque specified in the second term is supragingival plaque, the syllogism is invalid. Rule 1 is violated, because there are really four terms, not three, used. The middle term, plaque, has been used in two different senses. 2. In a valid standard-form categorical syllogism, the middle term must be distributed in at least one premise. This is the rule most often violated in everyday reasoning. In the example of an invalid syllogism involving Greeks and philosophers, the middle term, men, is not distributed. Despite being invalid, such conclusions may have a high probability. 3. A term that is distributed in the conclusion must be distributed in the premises. Frequently, this rule is used when evaluating arguments that contain premises starting with the word some. (a) If either premise starts with some, then the conclusion must also start with some. (b) No deductively valid conclusion can be made when both premises start with the word some. But other arguments that do not contain premises starting with some can also commit this error as shown below: premise 1 premise 2 conclusion

All Canadians are brave people. No Americans are Canadians. Therefore, no Americans are brave people.

The term brave people is distributed in the conclusion but not in the premises. 4. (a) If there is a negative premise, there must be a negative conclusion. (b) If there are two negative premises, the argument is automatically invalid. This rule is particularly important when interpreting negative results, for to show that a thing lacks one property does not necessarily prove that it has another. As an example of testing a categorical syllogism, consider the argument made by Iversen 18 on chalones. It is not acceptable logic to say: Chalones cause proliferation delay and thus tumor regression; cancer cure is always associated with tumor regression; ergo, chalones cure cancer.

80

Brunette-CH06.indd 80

10/9/19 10:45 AM

An Introduction to Deductive Logic

First, we can identify whether this is a logical argument, for Iversen states that it is not acceptable logic. The general form of the argument looks like a categorical syllogism, but we must translate it to standard form. The first premise is complex, as there is a causal link between proliferation delay and tumor regression as well as between chalones and tumor regression. However, inspection of the premises and the conclusion indicates that the three principal terms are chalone, tumor regression, and cancer cure. For the moment, we can ignore proliferation delay and concentrate on tumor regression. The argument recast in standard form becomes the following: premise 1 All chalones are tumor regressors. premise 2  All cancer cures are tumor regressors. conclusion Therefore, all chalones are cancer cures. In applying rule 1, the syllogism passes, because there are just three terms. However, rule 2 dictates that the middle term, tumor regressors, be distributed in at least one premise. The term tumor regressor is not distributed, so the syllogism is invalid. Iversen is correct when he states it is not acceptable logic. The fallacy of existential assumption The chalone example illustrates another problem of scientific logic: reification or misplaced concreteness. The concept of chalones is that of tissue-specific growth inhibitors, but, unlike various other growth regulators, chalones identified as such have not been purified; you cannot order a bottle of chalones from Sigma Chemical (which has almost every known biochemical). Thus, in a practical sense, chalones do not exist, and the argument commits the fallacy of existential assumption. However, the general concept of growth inhibitors exists, and the concept of growth inhibitors can be used in literature searches.

The mixed hypothetical syllogism Conditional propositions (also known as implications) take the form: If P then Q (P is the antecedent, and Q is the consequent). There are two valid forms of the mixed hypothetical syllogism that employ conditional propositions. The first, in the affirmative mood, is called modus ponens:

premise 1

premise 2 conclusion

P Q (valid)

Example: premise 1 premise 2 conclusion

If it is an enzyme, then it is a protein. It is an enzyme. It is a protein.

Notice that the second premise affirms the antecedent of the conditional proposition, not the consequent. If the consequent is affirmed, then the argument is invalid (the fallacy of affirming the consequent; see chapter 8). The other valid mixed hypothetical syllogism is in the negative form. Called modus tollens, it is given below: premise 1 premise 2 conclusion

If P then Q Not Q Not P (valid)

Example: premise 1 premise 2 conclusion

If it is an enzyme, then it is a protein. It is not a protein. It is not an enzyme.

The fallacy associated with modus tollens is the fallacy of denying the antecedent: If it is an enzyme, then it is a protein. It is not an enzyme. It is not a protein (invalid).

The pure hypothetical syllogism A pure hypothetical syllogism has the following structure: premise 1 premise 2 conclusion

If P then Q If Q then R If P then R

Example: premise 1 If it is an enzyme, then it is a protein. premise 2 If it is a protein, then it contains amino acids. conclusion If it is an enzyme, then it contains amino acids (valid).

If P then Q

81

Brunette-CH06.indd 81

10/9/19 10:45 AM

LOGIC: THE BASICS

The disjunctive (alternative) syllogism A valid disjunctive syllogism takes the form: premise 1 premise 2 conclusion Example: premise 1 premise 2 conclusion

Either P or Q Not Q P

Either Jones is a boy, or Jones is a girl (a disjunctive premise). Jones is not a girl (a categorical premise). Therefore, Jones is a boy.

The disjunctive (either/or) proposition states that at least one (and possibly both) of its components is true. The disjunctive syllogism is valid only where the categorical premises (eg, Jones is not a girl) contradict one part of the disjunctive premise, and the conclusion affirms the other part. The disjunctive syllogism is widely used in scientific reasoning. For example, should several plausible hypotheses be tested and all but one found not to be true, they would be eliminated, and the remaining hypothesis would be accepted. In statistical hypothesis testing, it is assumed that there are only two hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis is tested so that it can be rejected and the alternative hypothesis accepted. The major fallacy associated with the use of the disjunctive syllogism is the UFO fallacy (see chapter 9), where the list of possible alternatives is not exhausted in the premises, as could be argued in the Jones boy/girl dichotomy in the example.

Chains Arguments can be linked together so that the conclusion of one syllogism is a premise of a subsequent syllogism. Clearly, the chain is only as strong as its weakest link.

Suppressed premises Some arguments contain only two terms: a premise and a conclusion. For example, since all enzymes are proteins (premise), they are interesting to study (conclusion). To make this conclusion valid, a second premise must be true; namely, all proteins are interesting to study.

Suppressed premises occur frequently in scientific papers, and in everyday reasoning, because many writers do not like to belabor obvious truths. However, a good strategy for hiding a weak point in an argument is to use a suppressed premise, simply because a weakness not stated directly can be difficult to detect. Suppressed premises can occur in scientific papers as additional information that is needed, but is not supplied, to correctly interpret a statement. In most cases, the author believes that the information is so widely known there is no need to state it directly. For example, the most widely quoted paper in biomedical research is the method by Lowry et al19 for the determination of protein. The authors note that the amount of color obtained with the method varies with different proteins. The suppressed premise of legions who have used the method is that the color-producing qualities of their protein sample are the same as those of the protein (normally bovine serum albumin) used in the calibration of a standard curve. In some situations (eg, in solutions containing collagen) it would be unlikely for this assumption to be true. A major problem of an argument with a suspected suppressed premise is determining what the suppressed premise is. For the argument to be sound, the suppressed premise, together with the expressed premise, must yield a valid conclusion. But there may be several ways for an argument to be framed. Before accusing an author (or speaker) of any particular suppressed premise or error in logic, the lack of precision of logical form in everyday language prescribes trying various forms of statement (ie, categorical, hypothetical, disjunctive syllogisms) to test for more than one way of expressing the missing premise so that a valid argument still results.10

Occurrence and sources of error in deductive logic Deductive logic is sufficiently complex that it is easy to make errors. When study participants were asked to evaluate deductive arguments, the following kinds of errors were distinguished20,21: • Failure to accept the logical task. Many people fail to distinguish between a conclusion that is logically valid and one that is factually correct. These people evaluated the conclusion and not the logical form of the argument. • Restatement of a premise or conclusion so that the intended meaning is changed. Many individuals

82

Brunette-CH06.indd 82

10/9/19 10:45 AM

References

interpret propositions (such as, all As are Bs) to mean that the converse is true (all Bs are As). • Omission or addition of a premise. • Misinterpretation of a statement of the form No As are Bs (E statement) to mean that nothing has been proved. • Probabilistic inference (ie, reasoning that things with common qualities or effects are likely to be the same) accounted for many errors. This error can occur when rule 2 for valid syllogisms (ie, the middle term must be distributed at least once) is violated. • With implications, studies of participant understanding have shown that most everybody understands P implies Q, but only 20% understand that Not-Q implies Not-P or, expressed differently, denying the consequent disproves the antecedent.15 Most of these errors are errors in the validity of the logic, not the truth of the premises. A second group of fallacies argue from a false premise (see chapter 9). These arguments are so common and have persisted for so long that they may be viewed as being traditional fallacies.

the value of the process. In analyzing an argument, a researcher must determine the premises and the conclusions and consider carefully how terms are used. Formal analysis of arguments enables the researcher to see if premises are missing and, by setting the premises apart, forces the researcher to consider whether they are plausible. Finally, the validity of the logic can be assessed in a straightforward manner to determine when the premises are true if the conclusions are binding.

References 1. 2. 3. 4. 5. 6.

Disproof of deductive arguments A counterexample disproves the validity of a logical example when it shows that the premises can be true when the conclusion is not. When trying to disprove a conclusion, most individuals quickly search through their minds to find counterexamples to refute invalid conclusions rather than to refute the invalid conclusion by use of formal logical analysis.22 Attacking a conditional proposition can be done by (1) attacking the antecedent (ie, disproving P) or (2) attacking the implication by showing one case where P did not imply Q (ie, where P is true and Q is false).15

7. 8.

9.

10. 11. 12. 13. 14. 15.

The Value of a Formal Analysis of Arguments Having worked through this chapter, the reader may be wondering whether it was worth the effort. Although truth tables and Venn diagrams figure highly in logic textbooks, they are not normally found in the Journal of Dental Research. Moreover, it might be argued, deductive logic will not tell us anything new; that is, it will not tell us anything that was not already contained in the premises. While true, these considerations overlook

16. 17. 18. 19. 20. 21. 22.

Medawar PB. Induction and Intuition in Scientific Thought. London: Methuen, 1969:48. Salmon WC. Logic. Englewood Cliffs, NJ: Prentice Hall, 1963:1. Adler MJ, Van Doren C. How to Read a Book. New York: Simon and Schuster, 1972:162. Fisher A. The Logic of Real Arguments. Cambridge: Cambridge University, 1988:136–138. Fisher A. The Logic of Real Arguments. Cambridge: Cambridge University, 1988:22–27. Toulmin SE. The Uses of Argument. Cambridge: Cambridge University, 1958. Walton D. Informal Logic: A Pragmatic Approach, ed 2. Cambridge: Cambridge University Press, 2008. Walton D. Argumentation schemes. Argument diagramming. In: Walton D (ed). Fundamentals of Critical Argumentation. Cambridge: Cambridge University Press, 2005:84–171. Chapman LJ, Chapman JP. Atmosphere effect re-examined. In: Wason PC, Johnson-Laird PN (eds). Thinking and Reasoning. Harmondsworth: Penguin, 1959:83–92. Copi IM. Introduction to Logic, ed 4. New York: MacMillan, 1972. Salmon WC. Logic. Englewood Cliffs, NJ: Prentice Hall, 1963. Capaldi N. The Art of Deception. New York: Brown, 1971. Beardsley MC. Writing with Reason: Logic for Composition. Englewood Cliffs, NJ: Prentice Hall, 1976. Gilbert MA. How to Win an Argument. New York: McGraw Hill, 1979:28–39. Orsinger RR. The Role of Reasoning in Constructing a Persuasive Argument. http://www.orsinger.com/PDFFiles/constructing-a-persuasive-argument.pdf. Accessed 25 July 2018. Fowler WS. The Development of Scientific Method. Oxford: Pergamon, 1967:17–36 Watson JD, Crick FH. Molecular structure of nucleic acids. A structure for deoxyribose nucleic acid. Nature 1953;171:765 Iverson OH. Comments on “chalones and cancer.” Mech Ageing Dev 1980;12:211. Lowry OH, Rosebrough NJ, Farr AL, Randall RJ. Protein measurement with the Folin reagent. J Biol Chem 1951;193:265. Henle M. On the relation between logic and thinking. Psychol Rev 1962;69:366. Chapman LJ, Chapman JP. Atmosphere effect re-examined J Exp Psychol 1959;58:220. Johnson-Laird PN, Hasson U. Counterexamples in sentential reasoning. Mem Cognit 2003;31:1105.

83

Brunette-CH06.indd 83

10/9/19 10:45 AM

7 Introduction to Abductive and Inductive Logic: Analogy, Models, and Authority



ʻDid you observe his knuckles? . . . Thick and horny in a way which is quite new in my experience. Always look at the hands first, Watson. Then cuffs, trouser knees, and boots. Very curious knuckles which can only be explained by the mode of progression observed by—ʼ Holmes paused and suddenly clapped his hand to his forehead. ʻOh, Watson, Watson, what a fool I have been! It seems incredible, and yet it must be true. All points in one direction. How could I miss seeing the connection of ideas? Those knuckles how could I have passed those knuckles?ʼ” SIR ARTHUR CONAN DOYLE 1

T

he purpose of studying logic for the analysis of scientific papers is to evaluate the strength of the conclusions in relation to the evidence offered for their support. Deductive logic makes explicit the content of the premises and clarifies the relationship between the terms, but it does not yield anything new. Inductive and abductive arguments have conclusions whose content exceeds that of the premises and, for that reason, are never certain.

Abduction and Scientific Discovery Abduction is defined as reasoning that accepts a conclusion on the grounds that it explains the evidence.2 The term was proposed by Charles Sanders Peirce, the originator of pragmatism, who described himself as a “laboratory professor” and whom some consider the greatest American philosopher. He used the word abduction to address a variety of issues from the logic of discovery to the economics of research.2 In essence, abduction as a logic of discovery posits that when an unfamiliar natural phenomenon is observed, a scientific investigator typically hypothesizes an explanation out of all the theoretically possible explanations. In general, scientists make choices with some valid basis; otherwise, the task of formulating and testing hypotheses would be endless. Abduction can be considered a form of inference, like deduction and induction, but if the best available evidence is weak, it is a weak form of inference. Weinreb3 distinguishes abduction from deduction and induction as follows: The truth of the premises combined with valid or correct form make deductive and inductive arguments certain or probable, respectively; whereas the truth of the premises of an abductive argument makes the truth of the conclusion possible.

84

Brunette-CH07.indd 84

10/9/19 10:52 AM

Induction

In “The Adventure of the Creeping Man,”1 Sherlock Holmes observes that 61-year-old Professor Presbury has taken on apelike abilities and behaviors. Using abduction, Holmes reasons that the apelike behavior was produced by monkey-gland extracts, for Holmes had discovered that Presbury had obtained monkeygland serum in an attempt to rejuvenate himself. The motive for his treatment was that Presbury had become engaged to Miss Morphy, a young woman perfect in both mind and body. Holmes notes that her father (Professor Morphy) had not objected to the engagement, as Presbury was rich. To modern readers, the conclusion that monkey-gland serum could produce apelike behavior is pure science fiction,4 but, in that era, gland extracts were viewed as a possible means of rejuvenation. Sherlock Holmes’ name is popularly associated with deduction, but many of his “deductions” were actually instances of creative abduction.3,5,6 In such instances, his conclusions were by no means certain. Based on this story—in which one professor injects himself with mysterious serum to become virile and apelike, and another allows his daughter to be engaged to a rich man very much her senior—the reader might conclude through abduction that Doyle did not think highly of professors.

Induction This section considers some common forms of inductive reasoning and some of the fallacies associated with them. In evaluating inductive logic, the term acceptable or correct is used in place of the term valid. Valid implies that the truth of the premises guarantees the truth of the conclusion, which is never the case for inductive logic. There are three rules that may be applied to test the acceptability of the conclusion of an inductive argument. 7 1. The premises are true. 2. The argument has correct form. 3. The premises of the argument embody all available relevant evidence.

The role of additional evidence in inductive logic In contrast with deductive arguments, which are self-contained, additional evidence is relevant to

inductive arguments. Consider Doug, who is a dentist enrolled in a PhD program. Argument A: premises 1. 90% of dentists earn over $20,000 per year. 2. Doug is a dentist. inductive conclusion  D oug probably earns over $20,000 per year. Argument B: premises 1. 90% of graduate students earn less than $20,000 per year. 2. Doug is a graduate student. inductive conclusion Doug probably earns less than $20,000 per year. Arguments A and B both have correct inductive form and true premises, but they contradict one another. The difficulty is that neither A nor B uses all of the relevant evidence on Doug’s income. Hence, we must weigh the available evidence, or recombine it into a more suitable form, so that all the information is used. An important aspect of the role additional evidence plays in interpreting scientific papers is an aberration of logic, which can be described as proof by selected instances.8 In this form of deceit, an argument is claimed to be supported by certain facts that are, in fact, true. But while an argument may contain nothing but facts, it does not necessarily contain all the facts. If facts that are damaging to the conclusion are omitted, then the inductive logic used to arrive at the conclusion is flawed. Proof by selected instances is used in clinical presentations in the form of proof by selected cases. Clinicians seldom present their failures, often preferring to emphasize their successes. In the absence of good experiment design, which includes such factors as method of patient selection, elimination of bias, objective criteria, and appropriate comparison groups, such demonstrations of success can mean little. Moreover, there are common biases in information processing that impede even scientifically trained individuals. Physicians, for example, do not process the absence of cues as efficiently as they process the presence of cues.9 That is, they are more struck by data in which two markers are correlated positively and may not look for situations where one marker is present and the other is absent. The data presented in Table 7-1 are taken from a study involving 112 infants.10 At first glance, these data appear to support the hypothesis that children run a

85

Brunette-CH07.indd 85

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

Table 7-1 | Example of incomplete evidence Child (< 2 years) with caries

Teeth (no.)

Decayed teeth (no.)

Infant

S mutans score Mother

1

6

4

High

Moderate

2

6

4

Moderate

High

3

4

2

High

High

4

10

8

High

High

5

12

2

High

High

6

16

3

High

High

7

16

13

Low

Moderate

8

12

2

Low

High

Data from Brown et al.10

greater risk of caries and infection with Streptococcus mutans (the bacteria associated with dental caries) when the mother is infected with S mutans. The missing information, however, is the scores of the other 104 children and their mothers. If there were high scores among the mothers whose children did not have caries, the hypothesis would be weakened. If this information were to be omitted, the authors would be telling only part of the story, and the conclusions would be suspect. (Brown et al10 did report on the S mutans scores of all 112 participants in another table.) Proof by selected instances is also practiced by biomedical scientists. As noted by Trinkhaus,11 micrographs taken from the ultrathin sections used for electron microscopy show great variability. In consequence, it is easy to find the expected and to disregard the rest. If microscopists publish only pictures that support their conclusions, they are guilty of proof by selected instances. Moreover, proof by selected instances can be thought of as the practical application of the law of cognitive response (see chapter 4), a rhetorical device that advises would-be persuaders to focus attention on data that support one’s position. The requirement to use all available evidence explains the importance attached to the date of an article. As a general rule, a more recently published article is better. Authors writing in 1997 (when the first edition of this book was published) had less evidence available to them than authors writing in 2019. Thus, all other things being equal, the earlier authors are less likely to be correct. It is not necessarily illogical to change your mind. As available evidence changes, so must the conclusions. Indeed, perhaps the main reason that investigators or clinicians follow the

literature of their disciplines closely is so that they may use the best available information for their reasoning. Similarly, scientific investigators have the responsibility of searching widely in the literature so that the best available evidence can be considered. This duty is addressed explicitly in meta-analysis and systematic reviews where authors explicitly consider, record, and report all evidence used in their papers even when it involves prodigious effort such as hand-searching indices or attempting to retrieve so-called grey literature (see chapter 5). Interpreting the data logically and objectively is one problem but use of even well-interpreted data in decision making is another. The process of decision making is of critical importance to dentists, who propose treatments, as well as their patients, who accept or reject their dentists’ recommendations. Sadly, decision making, as commonly practiced, is not necessarily data-driven. The effect of emotions on decision making has been studied by eminent psychologists such as Gilbert.12 Humans, it appears suffer from presentism, a cognitive bias that limits our ability to imagine ourselves, for example, as happy when we’re sad, or as hungry after Thanksgiving dinner. Our choices do not reveal ourselves as coldhearted, rational decision makers, for our choices are inevitably influenced by our present emotions.13 Even our moral choices have been found to be based on quick emotional decisions that we later rationalize by composing a narrative in support of the choice.14 Thus, it seems that for making optimal decisions, rational data interpretation must be supplemented by disciplined decision making, a process that will be covered in chapters 22 and 23.

86

Brunette-CH07.indd 86

10/9/19 10:52 AM

Induction

Analogy Argument from analogy is based on a comparison of two apparently similar cases and can be a powerful tool of persuasion because it can appeal to something with which the audience has familiarity and sometimes feels strongly about. Analogy is often presented as an inductive argument in that its conclusions are not certain and some arguments may be based on similarities among multiple cases.15 In contrast, Walton16 feels that many arguments from analogy are best evaluated as plausible arguments because often the relevant issues in inductive argument, such as sample size or representativeness, are not addressed. In any case, analogy has the following form: premises

1. A  (the source) has the properties P1, P2, P3. 2. B (the target) has properties P1, P2, P3. 3. A (the source) also has Px. conclusion by analogy B (the target) also has Px. An analogy becomes stronger as the two classes of objects become more similar. For example, it would be better to test drugs intended for human use on monkeys rather than on insects. In general, to whatever extent there are relevant similarities (ie, to the point of the argument), the analogy is strengthened. Relevant differences weaken the argument.

Walton’s critical questions on the use of analogy As noted above, argument from analogy is not certain, for similarity is not identity. There is always some point at which the analogy will break down. This does not invalidate the argument form, because it is only important that the similarity holds in the respects that are relevant to the argument. Walton,16 however, notes that it can be difficult to discern when two arguments are plausibly similar and that arguments based on analogy may be slippery to either decisively confirm or refute. Because any comparisons can be made in a multitude of ways, a dispute involving analogy characteristically leaves the dispute open to further dispute (as occurs in the Appeal Courts of Law). A powerful analogy can shift the burden of proof from one side to the other, but likewise pointing out a relevant dissimilarity can shift the burden back again. Walton has identified three general critical questions on the use of analogy:

1. Is the comparison between the two situations plausible and right? 2. Is the analogy faulty because the two situations are not similar in a relevant respect? 3. Can a counter analogy be constructed? Walton uses the example of the issue of whether the United States should supply Contra insurgents in Nicaragua with arms. The proponents argued that the Contras were similar to the patriots in the United States War of Independence and thus merited support, a position that would appeal to many patriotic Americans. Their opponents believed that a more relevant similarity was that in providing arms to the Contras the United States was immersing itself in a situation not unlike the Vietnam War, which was powerfully repellent to many citizens who had lived in those turbulent times.

Analogy and the law Although it likely appears that analogy is a weak method of inference, it forms the basis of a considerable portion of our legal decisions. Orsinger notes that in difficult cases, where the legal rule to be applied is uncertain or when the facts are such that it is unclear how the legal rule is to be applied, then lawyers and judges shift to reasoning by analogy.17 A judicial precedent has been defined by a former Lord Chancellor as “a judgment or decision of a court of law cited as an authority for deciding a similar state of facts in the same manner, or on the same principle by analogy.”18 Because legal judgments are generally regarded to be reasonable decisions made by reasonable people, we can conclude that analogy is a useful form of argument. However, like all inductive arguments, analogies must be evaluated in the presence of all available information. It is thus clear why the rules of evidence are so important to the legal process. If, for instance, relevant information was suppressed, the reasoning process used to arrive at a decision would be unacceptable. Despite the widespread use of analogy in law, there is an active debate among legal theorists on the soundness of arguments based on analogy, and attempts have been made to transform analogies into frameworks of deduction.3

Analogy and scientific discovery Scientific theories have been constructed by analogy. An example of this is the Bohr theory of the atom, originally modeled on the solar system. Murphy19 has stated that the soundness of an analogy depends on

87

Brunette-CH07.indd 87

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

how faithfully the terms of the discussion are translated into symbols, and how faithfully the conclusion from the manipulation of symbols is translated back. In the Bohr model, the symbol represented by an electron as an entity with a discrete locale did not adequately explain some of its properties, and, although useful, the Bohr model has been superseded. Analogy is frequently used in proposing models to guide investigations, perhaps the most cited example being the Bohr model of the atom, mentioned earlier. As noted by Bell and Staines,20 analogies are easy to find but difficult to justify. A proponent must show in detail how the phenomena are alike and that the similarity does not derive from an arbitrary choice of a descriptive term. Like any explanation of phenomena, the analogy should make some testable predictions based on independent evidence, or in other words, evidence that was not taken into consideration in forming the analogy. Scientists use analogies frequently in problem-solving (3 to 15 analogies in a 1-hour lab meeting); often these analogies are based on superficial features in common.21 Moreover, scientists use structural analogies when formulating hypotheses, such as comparing gene structure and function between species.15 In a paper on treatment of temporomandibular joint disorders, Lous22 argued as follows: Joint stretching is a method of treatment frequently used in physical medicine. In 1955 Bang and Sury described a new principle in the treatment of intervertebral disc lesions. A traction splint . . . was used, and extension of the vertebral column could be demonstrated radiographically. About 50% of the patient group experienced relief of pain and increased range of movement. . . . A similar effect on the temporomandibular joint (TMJ) can be obtained by using spring mechanisms placed between the maxilla and the mandible. This is a clear case of proposing an experiment based on reasoning by analogy. The analogy suggests that what works for intervertebral disc lesions will work for the TMJ.

Models Reasoning by analogy in experimental science often occurs through the use of models. In principle, a model represents and demonstrates the fundamental principles of the system of interest. Mice carrying the lymphoproliferation (lpr) mutation demonstrate

many similarities to people afflicted with systemic lupus erythromatosus. It would be reasonable to test the effects of drugs or various immunotherapies on the affected mice, but the fruitfulness of such an approach would depend on how closely the disease of mice resembled that of humans. Other models can be more abstract. Physiologists model systems by writing equations for well-established processes that might be occurring in a system, performing the necessary calculations, and comparing the calculated results to the observed results. A close correspondence between the observations and calculations suggests—but certainly does not prove—that the processes are operating. All models are simplifications of reality, but the degree of simplicity varies. A conceptual model may be only a basic approximation of what processes might be operative. Physical models are widely used in science; the force distribution of implant designs has been studied by embedding the implants in plastic and using polarized light to observe stress patterns. Ideally, as in the physical sciences and engineering, modeling can comprise writing equations for well-established processes from prior theoretical knowledge. Models can often provide an explanation for a vast range of observed facts; in some instances, the explanatory scheme is so realistic that it can be called a mechanism. Additionally, an empirical approach to represent cause-effect relationships from experimental data sometimes employs modeling. Where there is a lack of understanding about the actual mechanisms or processes involved, a common approach is the black-box model, which does not rely on any intuitive interpretation in terms of the actual processes occurring in the black box; it merely relates input to output. Often, relating the input to output is done mathematically by the use of transfer functions, such as those devised by Laplace and Fourier.23 Such models are valuable, because they enable scientists to make accurate predictions in the absence of detailed understanding. There are sophisticated statistical means to test hypotheses and validate quantitative models. For example, in periodontics there has been much discussion of whether attachment-level changes occur in bursts or gradually. The decision of which model is operative is important, because it has implications for how a clinician should treat the disease. There are generally two criteria for model selection: (1) hypothesis testing and (2) cross-validation.23–25 In the hypothesis-testing approach, the model with the greater prevalence is put forward as the null hypothesis, and other models are tested as alternative hypotheses. The advantage of this method

88

Brunette-CH07.indd 88

10/9/19 10:52 AM

Induction

is that there is only a small chance to accept a more complex or less prevalent model; the disadvantage is that it favors the prevalent models. In the cross-validation approach, all models are considered equal, and the best model is the one that produces the minimum-error least-squares fit to the data when each data point is excluded sequentially and fitted with the model and the remaining data. In a study using the cross-validation technique, Yang et al24 found that neither the burst nor the gradual-loss models were good predictors of change in attachment level of patients with moderate to severe periodontal disease. Yang et al24 also used the blackbox approach, where the operator inside the black box was an autoregressive time-series model—that is, the past behavior of the site was used to predict its future. Sadly, they found that a model that fits well to past data cannot be accurately extended into the future. Thus, their findings illustrate the principle familiar to economists that “he who lives by the crystal ball will often have to eat shattered glass.” A problem with models is that particular assumptions that underlie a model’s use may not be valid. In the periodontics black-box study, like many economic forecasts, the assumption was that the processes that operated in the past would continue to operate in the future. Such an assumption may well be valid for brief periods, but could be unreliable over extended periods. Another issue with models is that they have limits; that is, their predictions hold over a limited range of possibilities. The example often used in this context is the mechanics of Newton that are perfectly able to explain and predict macro phenomena such as the movement of billiard balls after a collision on a table but are not similarly useful for describing behavior of particles at the subatomic level that are now the domain of quantum mechanics. In general, in science there is an expectation that investigators will test the validity or applicability of the assumptions of their models, although sometimes that is very hard to do. The Higgs boson for example was used as a theoretical construct to perform calculations for some 50 years before physical evidence of its existence was provided by experiments at CERN (the European Organization for Nuclear Research).13 The experiments required a 27-kilometer-long tunnel near the border of France and Switzerland, extremely sophisticated instrumentation, and a small army of collaborators from institutions scattered around the world.

Argument from authority In the discussion section of scientific papers, it is not unusual to see a statement such as, “Jones concluded that . . .”—usually an attempt to support a conclusion by virtue of the reliability of the person (or institution) making the statement. This is not necessarily good proof. Lord Kelvin, although widely acknowledged as a great scientist in his own time, pronounced shortly after their discovery that x-rays were an elaborate hoax,26 illustrating that in science the best appeal is not to authority but to observation. Indeed, the ultimate test of any statement must be observation. However, an appeal to authority is not necessarily wrong simply because it is not necessarily sound. Reasonable people often come to reasonable conclusions. If an authority has based judgment on objective evidence that could be examined and verified by any competent person, then such an authority is considered reliable. The format for the legitimate use of the argument from authority is given below. premises

conclusion

1. A  is a reliable authority on subject X (that is, the vast majority of statements made by A on subject X are true). 2. S is a statement made by A on subject X. S is probably true.

The most common ways that the argument from authority is misused are listed below27: • The authority may be misquoted or misinterpreted, or a statement may be used out of context. For this reason, when an appeal to authority is made, the source should be documented. • The authority may have no special competence in the subject under discussion. This misuse of appeal to authority often occurs in advertisements. Mickey Mantle, who was a great athlete, used to promote treatments for athlete’s foot, but, as far as I know, he had no special knowledge of mycology. In 1970, Linus Pauling, a two-time Nobel Prize winner, strongly advocated the use of vitamin C to cure the common cold. Many people believed that such an eminent scientist must be right. By 1979, however, numerous tests had demonstrated that vitamin C does not significantly affect the course of the common cold.28 Pauling was an expert, but an expert in chemistry, not nutrition or pharmacology.

89

Brunette-CH07.indd 89

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

• Authorities may express opinions about matters that have no available evidence. This is not legitimate, for one attribute of a reliable authority is that such an authority is evaluating objective evidence. • Authorities can disagree. If scientists quote only the authorities with whom they agree, they bias the evidence, and the appeal to authority is not legitimate because it is not using all available information. It is difficult to pick up this type of misuse unless a reader knows the literature well. In legal cases involving expert witnesses, a good strategy is for a lawyer to employ an expert who holds an opposing view to the expert employed by the opponent. If authorities disagree, the appeal to authority is weakened. A reliable authority is one who holds views that are representative of her field; if there is controversy, it is proper to cite the arguments for both sides of the issue. • An authority may have an axe to grind. Chapter 4, on rhetoric, illustrated that communicator credibility is enhanced by appearing to argue against self-interest, but the converse is also true. It is reasonable to be skeptical if an authority stands to gain something by having a view accepted. A common tactic for a lawyer questioning an opponent’s expert witness is to inquire whether the expert is being paid, as this self-interest will reduce credibility. However, some axes are less visible than others. For example, the Finance Minister of Canada, a very wealthy man, recently proposed tax changes to more efficiently tax the rich. It became known, however, that he himself used a variety of loopholes in the tax system, including the use of numbered companies and indirect control of firms likely to benefit from the tax changes, to dodge reasonable rules to prevent his conflicts of interest being apparent.29 Similarly, in the academic world, it can be difficult to know how experts can benefit by having their advice implemented. Visibility is low on confidential peer-review grants committees and editorial boards, and these expert-laden groups exert enormous influence and may attempt to suppress via the peer-review process the appearance and dissemination of findings that question or refute their findings. That obfuscation tactic was thought by some to have occurred in the Climategate scandal. • The authority must be current. I once knew a very engaging and articulate dentist who made a decent living as an expert witness. His reputation was based on a paper he had published some 40 years earlier in a prestigious journal. Armed with this paper and an

academic position, he was able to convince judges of his expertise, yet his knowledge (and presumably his testimony) could not be authoritative, as he had little knowledge of current developments in his field. The book Eat to Win by Haas achieved considerable popularity, a fact that is not surprising, as most people like to eat and like to win. The book’s popularity was further enhanced when tennis star Martina Navratilova credited her success to Haas’s program. Haas’s credentials, however, included a PhD from Columbia Pacific University, an unaccredited institution that offers nonresident doctoral degrees in 1 year or less by receiving credit for “life, work, and all learning experiences.”30 Knowledgeable reviewers of Haas’s work do not recommend the book, as it contains many basic errors and misconceptions.31 Other books on health or nutrition depend for their credibility on the MD after the author’s name. Remember the old joke: Q: If the person who graduates at the bottom of the West Point class is called the goat, what do they call the person who graduates at the bottom of the medical class? A: Doctor. Walton32 incorporates considerations of these issues in six critical questions: 1. Expertise Question: How credible is the Expert (E) as an expert source—what do E’s qualifications include? 2. Field Question: Is E an expert in Domain (D) in which the proposition at issue (A) resides? This question can be answered by testimony and evaluation of other experts supporting E’s status. 3. Opinion Question: What did E assert that implies A? 4. Trustworthiness Question: Is E personally reliable? What is E’s record of prediction or accomplishments in this field (ie, D) of expertise? 5. Consistency Question: Is A consistent with what other experts assert in publications, or other forms of reviewed evidence? 6. Backup Evidence Question: Is E’s assertion based on evidence? Despite these possible problems with the use of authority, scientists regularly defer to the knowledge of experts. Biochemists, for example, may rely on the findings of crystallographers, even though they do not

90

Brunette-CH07.indd 90

10/9/19 10:52 AM

Implications

understand in detail the principles and practices of that discipline. A problem with the spread of a multidisciplinary approach is that no single author of a paper may be able to justify intellectually the entire contents of an article—a desideratum, if not a requirement, for authorship. A Nobel laureate involved in a case involving a colleague’s fraud lamented, “one has to trust one’s collaborators”33; such trust may be required, because the expertise to assess the data may not be present in all of the authors of a paper. Argument by authority, then, can appear to be a sort of necessary evil in the complex world of modern science.

Implications The conclusions of scientific research can be regarded as provisional explanations or hypotheses. It is worth discussing the logic of implication (hypothetical inference) in detail, because formal hypotheses play a central role in descriptions of scientific method. Moreover, the most common means of criticizing scientific papers is by formulating alternative explanations of the data. Understanding the structure of the underlying reasoning shows why this approach is so effective. An excellent discussion of this topic can be found in Hempel,34 on which much of this section is based. For the purposes of logical analysis, the testing of hypotheses can be expressed using conditional statements. Conditional statements have the form: Let there be a hypothesis H, which implies a certain event E. If H then E, where H is called the antecedent and E the consequent. premise 1 If H then E. premise 2 E is true. conclusion It becomes more credible that H is true. Example: premise 1 If Emil killed the cat, then it has stopped breathing. premise 2 The cat has stopped breathing. conclusion It becomes more credible that Emil killed the cat. The fact that the cat has stopped breathing makes the statement “Emil killed the cat” possible, and hence more credible, than if the cat were alive. But other explanations are possible: the animal could have

died of natural causes, some other person could have killed it, etc. In this reasoning, note that the consequent (the cat has stopped breathing) is affirmed and is used to infer the truth of the antecedent. By deductive reasoning, this argument is fallacious, because the conclusion is not necessarily true even when the premises are true. Thus, in logic textbooks, this type of argument, which is a fallacy, is sufficiently common that it has been given a name: the fallacy of affirming the consequent. You can reduce the uncertainty of the inductive reasoning process by gathering more information. It is particularly effective if the hypothesis H implies other events. Example: event 1 If Emil killed the cat, then he will look guilty. event 2 If Emil killed the cat, then he will have cat fur on his hands. If each of these events in turn is found to be true, then the hypothesis becomes progressively more credible, although an absolute magnitude of its credibility is not determined. The example of Emil and the cat was chosen to illustrate the uncertainty of this form of inductive reasoning, but often it is difficult to see the uncertainty of the logic because of the high probability of the statements. Consider the following: premise 1 If this is a fair die, the ace will appear one-sixth of the time. premise 2 The ace has appeared one-sixth of the time. conclusion It becomes more credible that this die is a fair die. On the surface, this appears to be a legitimate argument, but a statistician would not be satisfied. An expert would want to know how many times the die had been thrown, for if premise 2 were based on just six throws of the die, the evidence that it was fair (or for that matter unfair) would be weak. The argument could then be attacked by constructing an alternative argument that would also fit the information: “If this die is loaded, then it is not unlikely that (if only a few tosses of the die were made) the ace might turn up one-sixth of the time.” This example illustrates the role of specialized knowledge in the criticism of scientific work; it enables alternative explanations to be suggested for a given phenomenon.

91

Brunette-CH07.indd 91

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

Disproving hypotheses

Auxiliary hypotheses

Although we cannot prove that a hypothesis is true, we can prove rigorously that it is false. Consider the following:

People do not like being wrong. When the available evidence indicates that their favored hypothesis is wrong, scientists often devise ingenious explanations to save their hypothesis. Such explanations may be possible, because the hypothesis was not tested directly but was tested by a method that required another hypothesis to be true as well. Suppose that a researcher had a belief that recurrent caries were more common under composite fillings than under amalgam. To test this hypothesis, the researcher examined the radiographs of patients who had received the two types of filling but found no difference. At first glance, it appears that the hypothesis must be rejected; however, there is a way out, which can be seen by putting the arguments into a standard format:

premise 1 premise 2 conclusion

If H is true, then so is E. Not E. H is not true.

This is a deductively valid argument called modus tollens, in which the conclusion is sound if the premises are true (see chapter 5). Hence, it is possible to disprove a hypothesis rigorously. To return to our example: premise 1 premise 2 conclusion

If Emil killed the cat, then it has stopped breathing. The cat has not stopped breathing. Emil has not killed the cat.

Although the disproof of a hypothesis can be logically rigorous, another problem remains. When you state that an event (E) has not occurred, you can only do this within the limits of detection that are set by the observational conditions. That is, the statement Not E is susceptible to two interpretations: (1) E did not occur and the conclusion is sound, or (2) E occurred but could not be detected by the observational conditions used, in which case the conclusion is not sound. Investigations in dental science sometimes report that there was no difference between treated patients and control subjects. But an observed result of no difference between treated and control groups could mean many different things, including the following:

premise 1 premise 2 conclusion Example: premise 1

premise 2

conclusion • There really is no difference. • There is a difference, but the measurement techniques used were too insensitive to detect the difference. • There was no difference in the particular patients used in the study, but selection of a different type of patient may yield a different result. • The treatment was not applied properly. The list could continue. Because failure to observe something can be explained in many ways, journals are often reluctant to publish papers based entirely on negative findings.

If both H and H-1 are true, then so is A. A is not true. H and H-1 are both not true.

If there is an effect of restorative material on recurrent caries (H) and all recurrent caries can be detected on radiographs (H-1), then we will see more caries under some materials. No effect was seen; ie, there was no statistical difference in the number of caries found under different types of restoration. It is not possible that there is both an effect of restorative material and that all recurrent caries are detected on radiographs.

It still can be maintained that there is a difference between the materials (ie, H is true), because it is possible that not all recurrent caries are detected on radiographs (that is, H-1 is false). H-1 (in this case, the hypothesis that your measuring technique is valid) is called an auxiliary hypothesis. The initial hypothesis can be defended by denying the truth of the auxiliary hypothesis. Thus, by adding auxiliary hypotheses, a favorite hypothesis can be protected from rejection. However, the overuse of this method weakens the credibility of the original hypothesis. Hypotheses of this

92

Brunette-CH07.indd 92

10/9/19 10:52 AM

Implications

kind, called ad hoc hypotheses, are particularly weak if they are arranged only to explain particular results and lead to no additional test implications.

Evaluation of scientific hypotheses As is evident from the above, the logic of implication is driven by hypotheses. A useful hypothesis for scientific investigations must have the following characteristics34: 1. Internally consistent. The hypothesis should not contradict itself on any specific question. Consider the problems of long-distance romances as analyzed by folk wisdom. On the one hand, it is said that “absence doth the heart make fonder,” but, on the other hand, it is said that “out of sight is out of mind.” Folk wisdom is not internally consistent; a scientific hypothesis, however, must not contradict itself. 2. Comprehensive. The hypothesis should explain all known relevant facts. A hypothesis is supported more strongly if a variety of tests or data agree with the predictions. A proven hypothesis’s support is strengthened by the variety of tests, because the more diverse the possibilities covered, the greater the chance of the hypothesis having been falsified. 3. Testable. The experiments or observations that test the hypothesis must be feasible. I attended a lecture in 1972 where an economist, after careful consideration of rates of oil use and available supply, predicted the world would run out of oil in 1980. The economist, of course, was wrong, but his hypothesis was good in the sense that it could be tested. Either there would be oil and gas available in 1980 or there would not. For science to progress, a useful hypothesis must do more than simply restate the data. It must extend our understanding by predicting results under conditions that have not yet been observed. If the conditions cannot be realized, then the hypothesis is useless. The logician Popper,35 who is credited with developing our modern conception of scientific method, has argued, “Simple statements, if knowledge is our object, are to be prized more than less simple ones because their empirical content is greater and because they are testable.” Indeed, Popper’s comparison of the works of Freud and Einstein led him to his important concept that falsifiability of a hypothesis was the key element in scientific method. Freud’s theories were sufficiently flexible

that they could be adjusted to explain everything; there was no way of testing them. But Einstein’s theory of relativity made some startling and precise predictions that could have been (but were not) disproved.36 4. Simple. Attempts to explain the logical basis for this criterion have not been entirely successful, but it would be difficult to argue that a complex hypothesis is necessarily better than a simple one. The most widespread use of this criterion occurs in statistical evaluation of data, when a significance level (ie, the probability of the result having arisen entirely by chance) is calculated. If this level is higher than a stated level (usually 5%), then the result likely can be explained in terms of random events, and there is no need to postulate any other reason for differences between groups. The use of this criterion is traced back to the theologian William of Occam, who decreed that one should not multiply causes without reason—a principle known as Occam’s razor. Unfortunately, Occam’s razor is not infallible. Many biologic systems have turned out to be more complex than were originally envisaged. Nevertheless, the intuitive idea that simple explanations are usually better than complicated ones has received quantitative support from Bayesian statistical analysis (to be discussed briefly in chapter 21), which shows, in agreement with Popper, that a hypothesis with fewer adjustable parameters has an enhanced posterior probability, because its predictions are sharp.37 5. Novel. The point of scientific investigation is to add to knowledge; if the hypothesis does not do this, it is of limited use. 6. Successful predictions. Because the human mind is so ingenious in coming up with explanations for existing data, successful prediction is usually considered stronger support for a hypothesis than the explanation of a number of observations known when the hypothesis was put forward. A hypothesis that predicts a mathematical relationship between variables is particularly strong, because the precision of the fit of the observed data to the hypothesis can be calculated. This was evident to Isaac Newton, who, it is said,38 proposed a universe of precision to replace a universe of “more or less.” It remains true today, particularly in hard sciences, such as physics or chemistry, where experiments are designed to test hypotheses framed in complex mathematical relationships.

93

Brunette-CH07.indd 93

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

Box 7-1 |  Common objections to hypotheses17 General principle: Most objections extend the reasoner’s current model of the situation under consideration. 1. Challenge to definitions and terms a. Hypothesis: If A then B. b. Objection: No, that depends on what you mean by A (or B). c. Example: Debates and differing criteria on what constituted periodontal disease—1, 2, or 3 mm of loss of attachment or gain in pocket depth? 2. Different conclusion a. Hypothesis: If A then B. b. Objection: No, if A then not B but C instead. c. Example: Hypothesis: Johnny hates girls: He puts their pigtails in the ink well. d. Objection: No, Johnny likes girls: That’s his way of getting their attention. 3. Different antecedent or cause a. Hypothesis: A causes B. b. Objection: No (not necessarily), C or D causes (or could cause) B. c. Example: Hypothesis: Periodontal disease is caused by Treponema denticola. d. Objection: No, periodontal disease is caused by a variety of gram-negative microbes. 4. Interference a. Hypothesis: If A then B. b. Objection: Normally perhaps, but because of C, B does not follow from A. c. Example: Hypothesis: Plaque causes inflammation. d. Objection: Normally it does, but in these immunosuppressed patients, it does not. 5. Irrelevant reasons a. Hypothesis: A implies B. b. Objection: No, A has little or no bearing on B. c. Example: Hypothesis: Gingival recession is the result of aging.

Criticizing hypotheses A scientific hypothesis is most effectively criticized by proposing a more plausible alternative. As noted previously, this is an area where expert knowledge can be helpful or even critical. Although scientists have specialized expert knowledge, I think they often use the same strategies in criticizing scientific papers that they use in everyday reasoning. Perkins and coworkers39,40 found that 80% of objections to arguments in everyday reasoning fall into just eight categories. Box 7-1 presents the nine most common objections to

d. Objection: No, gingival recession is not the result of aging per se but rather of the exposure of the tissues to plaque for long periods. 6. Too much/too little a. Hypothesis: If A then B. b. Objection: No, A is too much (too little) to bring about B. c. Example: Hypothesis: Antifluoridationists believe fluoride in water supplies causes fluorosis. d. Objection: Most dentists believe 1 ppm (common in water supplies) is too little fluoride to cause fluorosis. 7. Factor ignored a. Hypothesis: If A then B. b. Objection: No, the argument ignores factor C, therefore not B. c. Example: Caries studies that ignore the age distribution of the subjects. 8. Counter example a. Hypothesis: All As are Bs. b. Objection: No, here is an A that is not a B. c. Example: Hypothesis: All heavy plaque formers have periodontal disease. d. Objection: Eastern Europeans often have lots of plaque but relatively little attachment or alveolar bone loss (apocryphal data). 9. Saving revision a. Hypothesis: A does not imply B. b. Objection: No, combined with (or taking into account) C, A does imply B. c. Example: Hypothesis: Bleeding on probing does not predict attachment loss. d. Objection: No, but if bleeding on probing occurs three out of four times in succession, then it does predict loss of attachment.

hypotheses. The general principle is that most objections extend the reasoner’s current model of the situation under consideration. This approach is wellsuited to the criticism of scientific papers by experts on the topic, because such experts might be aware of some factor or consideration that is unknown to the author. I believe that individuals tend to use only a few common objections; for example, some people focus their critiques almost exclusively on definitions of terms. The list is given here so that readers can expand on the techniques they use to generate alternative hypotheses.

94

Brunette-CH07.indd 94

10/9/19 10:52 AM

References

Summary of the Logic of Criticism of Inductive Arguments The conclusion of many scientific papers often boils down to either proposing a hypothesis that explains the data or stating that no effect was observed. Some articles contain a conclusion that can be considered a hypothesis. We know from the logic of inductive arguments that the hypothesis can be considered as only more or less credible, depending on the strength of the supporting data. Such hypotheses can be criticized by either (a) assessing the data, including techniques, methods of observation, data analysis, or other components of the study, or (b) analyzing the logic by formulating a plausible alternative hypothesis that explains the data better than, or at least as well as, the hypothesis put forward by the author. Other articles deny the truth of some particular statement. I call these negative results papers. There are various ways that such denials can be made. A paper might state that some treatment did not have any effect. This means that, as the authors are denying the hypothesis, the logic (modus tollens) is valid. Thus, the most effective strategy for criticizing such papers is to question the data. In particular, we often want to determine if the methods used in the study were sensitive enough to conclude reasonably that some event or effect did not occur.

9.

10.

11. 12. 13.

14. 15. 16.

17.

18. 19. 20. 21. 22. 23. 24.

25.

References

26. 27.

1.

2. 3. 4. 5.

6.

7. 8.

Doyle AC. The Adventure of the Creeping Man. https://sherlock-holmes.classic-literature.co.uk/the-adventure-of-thecreeping-man. Accessed 30 July 2018. Honderich T (ed). The Oxford Companion to Philosophy. New York: Oxford University, 1995. Weinreb LL. Legal Reason: The Use of Analogy in Legal Argument. New York: Cambridge University, 2005. Van Liere EJ. The physiological Doctor Watson. Physiol 1958;1:53– 57. Harrowitz N. The body of the detective model: Charles S. Peirce and Edgar Allan Poe. In: Eco U, Sebok TA (eds). The Sign of Three: Dupin, Holmes, and Peirce. Bloomington, IN: Indiana University, 1988:179–197. Eco U. Horns, hooves, insteps: Some hypotheses on three types of abduction. In: Eco U, Sebok TA (eds). The Sign of Three: Dupin, Holmes, and Peirce. Bloomington, IN: Indiana University, 1988:198–230. Salmon WC. Logic. Englewood Cliffs, NJ: Prentice Hall, 1963. Thouless RH. Straight and Crooked Thinking. London: Pan, 1974.

28. 29. 30. 31. 32.

33. 34.

35. 36.

Christensen-Szalanski JJ, Bushyhead JB. Physician’s use of probabilistic information in a real clinical setting. J Exp Psychol Hum Percept Perform 1981;7:928–935. Brown JP, Junner C, Liew V. A study of Streptococcus mutans levels in both infants with bottle caries and their mothers. Aust Dent J 1985;30:96–98. Trinkhaus JP. Cells into Organs: The Forces that Shape the Embryo. Englewood Cliffs, NJ: Prentice Hall, 1984. Gilbert DT. Stumbling on Happiness. New York: Vintage Books, 2007. Hidalgo CA. Not quite rational man. https://www.city-journal. org/html/not-quite-rational-man-15130.html. Accessed 30 July 2018. Haidt J. The Righteous Mind. London: Penguin Books, 2013. Copi IM. Introduction to Logic, ed 14. Harlow: Pearson, 2014. Walton D. Natural language argumentation. In: Walton D (ed). Informal Logic: A Pragmatic Approach, ed 2. New York: Cambridge University, 2008:289–332. Orsinger RR. The Role of Reasoning in Constructing a Persuasive Argument. http://www.orsinger.com/PDFFiles/constructing-a-persuasive-argument.pdf. Accessed 31 July 2018. Jowitt WA, Walsh C. Dictionary of English Law. London: Sweet & Maxwell, 1959. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University, 1985. Bell PB, Staines PJ. Reasoning and Argument in Psychology. London: Routledge & Kegan Paul, 1979. Dunbar K, Blanchette I. The in vivo/in vitro approach to cognition: The case of analogy. Trends Cogn Sci 2001;5:334–339. Lous I. Treatment of TMJ syndrome by pivots. J Prosthet Dent 1978;40:179–182. Hall CW. Errors in Experimentation. Champaign, IL: Matrix, 1977. Yang MC, Marks RH, Clark WB, Magnusson I. Predictive power of various models for longitudinal attachment level change. J Clin Periodontol 1992;19:77–83. Spilker B. Guide to Clinical Interpretation of Data. New York: Raven, 1986. Kuhn TS. The Structure of Scientific Revolutions, ed 2. Chicago: University of Chicago, 1970. Salmon WC. Logic. Englewood Cliffs, NJ: Prentice Hall, 1963:63– 67. Coulehan JL. Ascorbic acid and the common cold: Reviewing the evidence. Postgrad Med 1979;66:153–155. McParland K. Bill Morneau discovers an elite, tax-dodging loophole exploiter: Himself. National Post. 18 Oct 2017. Barr SI. Eat to win: An opinion. BC Runner 1985;2:3. Koslowski B. Theory and Evidence: The Development of Scientific Reasoning. Cambridge: MIT, 1996. Walton D. Appeals to authority. In: Walton D (ed). Informal Logic: A Pragmatic Approach, ed 2. Cambridge: Cambridge University, 2008:209–245. Judson HF. The Great Betrayal: Fraud in Science, ed 1. Orlando: Harcourt, 2004. Hempel CG. The test of a hypothesis: Its logic and its force. In: Philosophy of Natural Science. Englewood Cliffs, NJ: Prentice Hall, 1966:19–46. Popper KR. The Logic of Scientific Discovery. London: Hutchinson, 1959. Magee B. Philosophy and the Real World: An Introduction to Karl Popper. LaSalle, IL: Open Court, 1985.

95

Brunette-CH07.indd 95

10/9/19 10:52 AM

INTRODUCTION TO ABDUCTIVE AND INDUCTIVE LOGIC: ANALOGY, MODELS, AND AUTHORITY

37. Jefferys WH, Berger JO. Ockham’s razor and Bayesian analysis. Am Sci 1992;80:6472. 38. Westfall RS. Newton and the fudge factor. Science 1973;179:751– 758. 39. Perkins DN, Allen R, Hafner J. Difficulties in everyday reasoning. In: Maxwell W (ed). Thinking: The Frontier Expands. Philadelphia: Franklin Institute, 1983:177–189.

40. Nickerson RS, Perkins DN, Smith EE. Failures to elaborate a model of a situation. In: Nickerson RS, Perkins DN, Smith EE (eds). The Teaching of Thinking. Hillsdale: Erlbaum, 1985:136–140.

96

Brunette-CH07.indd 96

10/9/19 10:52 AM

8 Causation

Hypotheses of Cause and Effect Perhaps the most influential of hypotheses—and the most debated ones—are hypotheses concerning cause and effect. Such hypotheses are influential because they can hold the promise of altering human behavior or the environment and improving the human condition. Causal arguments often have practical consequences. For example, there is continued debate on the role of human activity on climate change and the measures proposed to alleviate it. In Canada, for example, we face the imposition of a carbon tax that will affect taxpayers’ pocketbooks, and given Canada’s relatively small population, many would argue that such minor changes induced by the tax would have no appreciable effect on the global problem of climate change. The observation that hypotheses related to cause/effect are often controversial stems not only from their potential importance but also from the complexity of determining causal relations in the face of incomplete, even conflicting data, in which many potential factors may be causing or influencing the outcome. The determination by individuals of which factors are pivotal and which ones are just noise can be a selective and biased process, based on the individual’s experience, which fills in gaps in the arguments. Trankall2 suggested that causal interpretations are inevitably based on a personal interpretation of the data in which a logical completion mechanism fills in causal gaps based on patterns of earlier experience. This process can be seen in the explanation of elections or predicted effects of policies by commentators of the left or right who see the world through different prisms and come to different conclusions. Part of lived experience is the books we read, and the books we read tend to reinforce our existing viewpoints. An analysis of book sales on Amazon commissioned by The Economist3 utilized the recommendation function (“Customers who bought . . . also bought . . .”) performed by the data scientist V. Krebs found that as a general rule people who buy conservative books buy only conservative books. If you were a liberal, you might think “typical right-wing closed-minded



. . . the principle of identifying causes of events by looking for factors that covary with the events will occasionally lead one to identify artifacts rather than genuine causes and to overlook generally causal correlations because the relevant covariates have not been noticed.” BARBARA KOSLOWSKI1

97

Brunette-CH08.indd 97

10/9/19 10:53 AM

CAUSATION

bias.” However, it was also found that the same pattern was true of those on the left, as a general rule those who buy liberal books only buy liberal books. Relatively few books are read both by liberals and conservatives. It is thus no surprise that, having different information and differing interpretations, liberal and conservative parties propose conflicting policies and have difficulties coming to agreement.

Indeed, that activity probably constitutes the main business of science and engineering. Francis Bacon is credited with the aphorism, “Knowledge is power.” Moreover, he elaborated: “The end of our Foundation” (of a House of Solomon in his New Atlantis [1627]) “is the knowledge of causes and the secret motion of things, and the enlarging of the bounds of the Human Empire, to the effecting of all things possible.”7 Development of scientific knowledge of causes thus has been viewed as a very practical affair.

Common sense One solution to such conflicts is that people should use “common sense.” Indeed, the philosopher and scientist Bronowski4 has described science as organized common sense, but defining that property precisely is difficult. The Oxford Companion to Philosophy notes that common sense defies definition, and no one has succeeded in giving it.5 In that respect it resembles obscenity, where the learned Judge Stewart pronounced on hardcore pornography: “I shall not today attempt to further define the kind of materials I understand to be embraced by that shorthand definition, and perhaps I could never succeed in doing so. But I know it when I see it.”6 But eventually decisions have to be made, and procedures and rules evolve to make them. In everyday life, perhaps the most important use of commonsense reasoning occurs in judging cause-effect relationships. The belief underlying investigative approaches to identifying causes might be called the commonsense view of causality, which rests on the faith of the regularity or uniformity of nature. It has two tenets: 1. An identical cause will always produce an identical effect; thus, experiments can be repeated. 2. There is a reason for any change; events do not occur at the whim of the gods. For example, the appearance of disease must be caused by some change in either the internal milieu of the body or the environment.

Investigation of causality A problem in establishing causality is that the proponents advocating opposing positions all have axes to grind. The question then arises of how we objectively determine that X causes Y in the first place. Many approaches using a wide variety of methods have been employed to find out the causes, or at least the conditions, that will bring about a certain effect.

An 18th-century philosopher’s view: David Hume The Scottish empirical philosopher David Hume denied that there could be any logical validity in our conception of cause and effect. Indeed, Hume’s skepticism went further in that he held that belief in the external world and personal identity were not rational but instinctive. A lack of belief in the external world is not likely a view to be easily accepted and certainly not commonsensical. Hume was contradicted by his contemporary Scottish philosopher and champion of common sense, Thomas Reid, who held that common sense was based on certain innate properties of human nature.8 Hume did not answer Reid’s criticisms, but rather disparagingly suggested that Reid avoid Scotticisms and improve his English.8 Nevertheless, Hume contributed to commonsense reasoning as he clearly defined why people commonly attach cause-effect relationships to a series of events.9,10 In his A Treatise of Human Nature, Hume listed eight principles often used to establish causes for effects. Hume’s treatise even included such concepts as a dose-response relationship, using an example of the amount of heat and the response of pain or pleasure. Broadly speaking, Hume’s principles can be condensed to four rules for justifying claims of cause-effect relationships: temporality, contiguity, repeated conjunctions, and best available alternative. Study designs vary in their ability to support those four principles. 1. Temporality: The cause preceded the effect in time. It is common sense that for many phenomena, the longer the interval between the cause and the effect, the less the correlation between them, because the intervening time presents the opportunity for other events to exert an influence. Thus, sometimes the term temporal proximity is used as a criterion in place of the less specific cause precedes the effect. Study

98

Brunette-CH08.indd 98

10/9/19 10:53 AM

Hypotheses of Cause and Effect

designs that rely principally on this criterion include case study, case series, and cohort studies, as well as randomized controlled trials. 2. Contiguity: The cause and effect are contiguous in time and place; that is, a connection must exist between the two events that explains how they are connected causally. Animal models and laboratory research are the investigative designs that can contribute most strongly to arguments dealing with links between the cause and effect. 3. Repeated conjunction: There is a history of regularity in the relationship of the cause and the effect. Typically, this criteria is met by establishing numeric association between the putative cause and the effect. Study designs based on this principle include ecologic studies, cross-sectional surveys, and case-control studies. The third condition is often employed. It foreshadowed the use of probability theory and associated measures by modern epidemiologists, who apply sophisticated statistical measures of association and regression to cause-effect relationships. 4. Elimination of alternatives: The fourth principle is generally attributed to John Stuart Mill.

A 19th-century philosopher’s view: Mill’s methods John Stuart Mill developed a number of methods called canons of induction for the analysis of causes and effects in certain situations, including the method of agreement, the method of difference, the joint method of agreement and difference, the principle of concomitant variation, and the method of residues.11 In practice, the canons of induction boil down to one simple rule: Vary one factor at a time, and observe the result.11 Various means used in an attempt to achieve this ideal will be given in chapter 20, on experiment design. As noted earlier, hypotheses are generally criticized by proposing alternative hypotheses, and considering potential criticisms in advance would appear to be a sound rhetorical tactic.

Causal density Manzi 12,13 introduced the term causal density to distinguish the substantially different conditions to establishing causal relationships pertaining in different scientific spheres. Physics, for example, is blessed with the ability to develop universal laws through two happy

circumstances: Predictive rules like the law of gravity (1) apply everywhere and (2) do not change with time. That is, there is uniformity and low causal density so that observations can be controlled. Similarly, biology, although having a much higher causal density (ie, factors that might cause results to vary), exhibits what might be called a uniform biologic response, and experiments can be done through the use of randomization that balance competing causes among the groups being compared. Manzi believes that the very high causal densites in social sciences makes it difficult to generalize results to other populations and circumstances. Manzi reports that of the 122 known criminology randomized field trials executed between 1957 and 2004 with at least 100 participants, only about one-fifth demonstrated positive results but all of these failed to show consistent positive results. Moreover, of the 12 multisite randomized field trials in the sample the only successful program showed only a small effect that faded away with time. The one successful program, “broken windows” policing, appears to work, but has been associated with controversial police practices. In general, Manzi believes that in place of the powerful, compact, and universal laws of physics, investigations in the social sciences have yielded extremely conditional statistical statements. As thinking on causation developed, it soon became clear that the concept of “cause” was complex and there were differing types of causes that applied to situations of varying causal densities.

Deterministic cause: Necessary versus sufficient conditions In discussing cause-effect relationships, different types of causes and patterns of causation have been noted: A deterministic cause is one that inevitably produces the effect. For example, if the battery is dead, the car doesn’t start. A necessary condition is a condition that must be present to obtain an effect. One of the necessary conditions for dental caries is the presence of bacteria. If the necessary conditions of an event are known, it can be prevented from happening, simply by removing one condition. To prevent caries, we attempt to remove bacteria. A sufficient condition is a condition that automatically leads to another event. The difference between the necessary and the sufficient condition is that although a necessary condition must be present, it alone will not necessarily produce the effect; a sufficient condition is sufficient to produce the effect by itself. In some cases,

99

Brunette-CH08.indd 99

10/9/19 10:53 AM

CAUSATION

a sufficient condition is really a set of necessary conditions, all of which must be present at the same time and place. Dental caries, for example, is the result of the interaction of diet and bacteria in a suitable host. Insufficient but nonredundant part of an unnecessary but sufficient (INUS) condition is a concept that was developed by Mackie.14 Consider the formation of dental caries, which typically involves two specific types of bacteria, Streptococci mutans and lactobacilli, as well as dietary sugars and a susceptible host. The presence of sucrose could be considered an INUS condition. By itself sucrose cannot produce caries, so it is nonredundant. It is an unnecessary condition, because there are other sugars (eg, fructose, that can also support caries development). Nevertheless, when sucrose is combined with an appropriate set of other conditions, (bacteria, susceptible host), the combined set is sufficient to cause caries. INUS conditions can explain failures to replicate experiments or clinical procedures. Technique-sensitive approaches to treatment in dentistry may be INUS conditions for a desired outcome in that they work in the hands of some dentists but not others because of variation in the dentists’ technical approach. For example, it has been estimated that more than 37% of composite resin restorations are clinically insufficiently cured.15 However, light curing is a more complex process than first meets the eye, because successful curing involves the output power, spectrum, and tip design of the light; the exposure time; resin chemistry; the type of photoinitiator; the location and orientation of the restoration; the presence of any components that block light; and the clinician’s ability to aim and maintain the light on target.16 Attainment of optimal curing means getting all these INUS conditions right. In that sense, many factors that we interpret as deterministic causes are in effect INUS conditions, because we rarely know all the factors that are required to produce an effect and thus may not note conditions that are frequently present. A probabilistic cause is one that that increases the probability of an effect. For example, smoking increases an individual’s probability of developing lung cancer, but the effect is not inevitable. Cigarette smoking causes about 90% of lung cancers, but the majority of smokers never develop lung cancer. Hume,10 in fact, came to the conclusion that all causes were probabilistic causes if the sequence of events were examined in sufficient detail.

Causal factors/contributory causes Conditions that may be neither necessary nor sufficient conditions but nevertheless stand in some sort of causal relationship to the phenomenon are called causal factors or contributory causes. Consider the relationship between cigarette smoking and lung cancer. Both retrospective and prospective studies have demonstrated a statistical link between smoking and lung cancer. However, as noted by Stone17: • Many people who smoke do not contract lung cancer; therefore, smoking is not a sufficient cause. • Some people who do not smoke do contract lung cancer; therefore, smoking is not a necessary cause. However, it remains possible that: • Smoking is one of a group of causal factors that is sufficient for lung cancer. • In some instances, smoking introduces a factor X that may be a necessary and/or sufficient cause. • A hidden factor Y exists that may cause both smoking and lung cancer. One might argue that, although improbable, some factor, such as stress, may cause an immediate tendency to smoke cigarettes and a later one to develop cancer.

Practical Assessment of Causation These commonsense concepts of Hume and Mill continue to be employed in everyday life, clinical practice, and scientific research. Every Canadian has probably engaged in the cause-effect thought processes entailed when their cars don’t start on a frigid winter’s morning. Etched in their minds is a checklist of possibilities of more or less probable causes depending on the car. • The battery is “dead” or operating suboptimally because at low temperature it cannot deliver sufficient energy to turn over the engine. • The alternator that charges the battery is not functioning. • The gas supply line is blocked, perhaps by ice formation. • The gear selector lever is not positioned in “Park.” • The operator’s foot is not on the brake. • The starter motor is broken. • Etc, etc.

100

Brunette-CH08.indd 100

10/9/19 10:53 AM

A Threat to Establishing Causation via Association: Confounding Variables

Depending on one’s experience, the most likely and easy to test hypotheses, such as those involved in operating the car and outlined in the owner’s manual, would be investigated first. Then, with increasing dread, the operator would proceed though the list to the those causes with lower probability or higher cost to remedy such as replacing the battery or the starter motor. As an example of commonsense reasoning in dentistry, consider a dentist attempting to diagnose and treat a patient suffering from oral ulcers, which have many putative causes, such as aphthous stomatitis, infections, trauma, burns, poorly fitting dentures, or allergies. All these conditions could, in theory, meet Hume’s second criteria, contiguity—a reason to connect the cause to the effect. Besides closely observing the oral tissues and the dentition, the dentist would typically want to know when the condition developed and how it progressed. For example, if the ulcer arose soon after the placement of new dentures, that possible cause might be considered a likely cause because it fulfills Hume’s first criteria, temporality: The effect occurred after the putative cause. Although such clinical reasoning seems simple, it is also powerful. Sackett et al18 stated that clinical reasoning is far more powerful than laboratory evaluation for most patients in most places. Various strategies are used to optimize this reasoning, such as the mnemonic VINDICATE (vascular, inflammatory, neoplastic, degenerative, idiopathic, congenital, autoimmune/allergic, traumatic, endocrine) for differential diagnoses to systematically cover various possibilities. Sackett et al18 described the approach adopted in this example as a hypotheticodeductive strategy, which involves “formulation, from the earliest clues about the patient of a short list of potential diagnoses and actions . . . followed by maneuvers that will best reduce the length of the list.” Unlike scientists, who allegedly attempt to disprove their hypotheses, clinicians generally seek data that support their working hypothesis. If the dentist in this example produced a new set of dentures and the ulcer went away, the dentist could use the canons of induction to draw a conclusion: Because only one factor, the dentures, had been changed, the denture was the cause of the problem. In scientific investigations, meeting criterion 1 (temporality) is one of the strengths of longitudinal cohort studies as well as clinical trials in which investigators introduce a putative cause and observe the result; cross-sectional surveys by themselves do not meet this criterion. The connections between cause and effect (contiguity) must be investigated to clarify the mechanisms of action linking the cause to the effect, as for example the action

of radiation on DNA and the subsequent production of mutations and malignancy. The criterion of history of regularity (association) is documented by the use of statistical measures of association to be discussed in more detail in subsequent chapters.

A Threat to Establishing Causation via Association: Confounding Variables A confounding variable (confounder) is defined for epidemiologic purposes as a risk factor for the disease that is associated with the exposure (the variable of interest in a study) and thereby contributes to the observed association between the exposure and the disease. Smith and Philips19 concluded that it is likely that many of the associations identified in epidemiologic studies are due to confounding, often by factors that are difficult to measure. Confounding is often offered as an alternative explanation for an association between an exposure and a disease. There are several criteria for determining a confounding variable: (1) there should be an association of the confounder and the disease among the unexposed individuals; (2) there should be a difference in the distribution of the confounding variable among the different exposure groups; and (3) the confounding variable should not lie on the causal pathway between the exposure and the disease. Confounding results from imbalances in risk factors for the outcome (ie, disease) in different exposure groups. Confounding muddies the association of a risk factor with a disease and can either increase or diminish the effect of the exposure variable. Confounding is often detected by comparing the discrepancy between the crude and adjusted estimates for relative risk. If adjusting for a potential confounder changes the relative risk (to be discussed later), the variable is suspected to be a real confounder.

Adjustment In epidemiologic studies, the possibility that confounding could account for observed associations is managed by demonstrating that the association is independent of the confounding factor.16 An exposure is said to be independently associated with the outcome if the association remains after the levels of other exposures are

101

Brunette-CH08.indd 101

10/9/19 10:53 AM

CAUSATION

Effect 1 Cause

Effect Cause

Effect 2

Effect 3

Fig 8-1 Direct. A single cause produces a single effect.

Fig 8-2 Pleiotropy. A single cause produces multiple effects.

Cause 1 Cause 1

Effect 1 Cause 2

Effect 2

Effect Cause 2

Fig 8-3 Causal chain.

held fixed; this process is called controlling, adjusting, or conditioning for other exposures.

Patterns of causation The concept of causal factors leads us to consider that there are different patterns of causation (discussed in greater detail in a complex and voluminous literature). Various patterns and definitions have been employed, but only the most common are discussed here. In practical terms, causality in epidemiology can be considered as a relationship between exposure and disease that will always be observed under stable conditions. However, the relationship may change if new intervening or interacting variables are introduced. Dentistry and medicine have an abundance of possible factors involving host, agents, and environment that could influence outcomes, and these complex interrelationships give rise to the epidemiologic concept of the web of causation.20 The web of causation has been described as a metaphor for the idea that causal pathways are complex and interconnected. A correspondingly large number of models have been developed to explain how causes can relate to effects. I have discussed the issue in greater detail in another publication.21

Fig 8-4 Conjunction/interaction. Two or more causes occur together to produce an effect.

Direct A single cause produces a single effect (Fig 8-1). The single-headed arrow in the diagram indicates that the association between cause and effect is directed forward in time (the cause precedes the effect). In biology and medicine, direct causes are defined as those that act at the same level of organization (ie, cell, organ, or organism) as the effect that is being measured. Such causes are most readily identified when the time between the introduction of the cause and observance of the effect is short and the cause is deterministic. In the clearest instances, the effect occurs immediately. Treatment of dental abscesses by incision and draining would be considered a direct cause of pain relief from the perspectives of the dentist and the patient. However, a neurophysiologist might view the treatment as a causal chain involving several steps in detection, neural pathways, and processing.

Pleiotropy When a single cause produces multiple effects, it is called pleiotropy (Fig 8-2). The triple response of physiology illustrates a single stimulus of pressure producing redness, swelling, and heat. Examples of pleiotropy

102

Brunette-CH08.indd 102

10/9/19 10:53 AM

A Threat to Establishing Causation via Association: Confounding Variables

are found in genetic diseases that affect many physiologic systems. For example, cystic fibrosis is caused by a mutation in the gene that regulates the movement of chloride and sodium ions across epithelial membranes and thus has multiple effects in that the composition of sweat, digestive fluid, and mucus is altered.22 However, if the nature of the genetic disease is unknown, researchers might postulate that a secondary effect is the primary malfunction. A single cause may induce different effects under the influence of intervening mechanisms. For example, the force generated by an orthodontic appliance can cause bone resorption of the socket wall where the vector of force is being applied but bone apposition of the socket wall opposite to where the vector of force is being applied.

Causal chain A causal chain is an ordered series of events such that the first cause leads to the first effect, which is the cause of the second effect, and so forth (Fig 8-3). In a causal chain, the first cause (C1) leads to the first effect (E1), which in turn leads to a second effect (E2). E1 may be considered an intervening variable and E2 the outcome variable. C1 might not have any effect if the link between E1/C2 and E2 were destroyed. An inflamed tooth would activate the trigeminal nerve ganglion, which would lead to the perception of pain. If the ganglion were inactivated—for example, by local anesthetic—there would be no perception of pain. When proposing causal chains, it must be remembered that the putative causes and links in the chain must satisfy appropriate criteria. Migliorati23 proposed a causal chain when he speculated, in a letter to the editor of a journal, that “bisphosphonates may cause oral avascular bone necrosis due to anti-angiogenic effect leading to inhibition of osteoclasts.” This argument came in a letter that presented only five cases in which bone necrosis had followed bisphosphonate administration. Such a small number of cases might not ordinarily seem convincing but, because a biologic pathway to explain the observation was demonstrated, thus fulfilling Hume’s criterion 2 (a connection between the putative cause and the effect), the argument that bisphosphonates were a cause became much more plausible. Causal chain arguments are often important in buttressing the plausibility of many exposure-disease causal relationships. One problem with chains is that they may branch; that is, sometimes changes in the intervening variable (E1) do not necessarily cause a change in the outcome variable (E2), as was

illustrated in the clofibrate example (see chapter 2) and other instances where surrogate variables have proven unreliable predictors of outcome variables.

Conjunction/interaction Two or more causes that necessarily occur together to produce an effect are called conjunctions or interactions (Fig 8-4). Dental caries forms in a susceptible host only if certain bacteria, dietary conditions, and host factors are present. Conjunctions can also lead to multiple effects. Interaction can change the direction and magnitude of an association between two variables. The interaction effect may be the result of a cumulative effect of multiple risk factors that are not acting independently, and interaction may produce an effect that is lesser or greater than the sum of the effects of each factor considered separately. Intriguing examples of this phenomenon are being found in studies of gene-environment (G-E) interactions. Monoamine oxidase (MAO) is an enzyme with a variant A (MAOA) that has been associated with antisocial behavior. In a comprehensive longitudinal study in Dunedin, New Zealand,24 it was found that the effects of the variant gene did not affect everyone equally; in fact, only those that experienced abuse or maltreatment at a young age were affected. The statistical analysis employed sophisticated regression models of the following form: f(Y) = B0 + B1 (MAOA) + B2 (abuse) + B3 (MAOA × abuse) where f(Y) is antisocial behavior; B0 is the intercept; and B1, B2, and B3 are the regression coefficients for presence/ absence of MAOA, amount of abuse prior to age 16, and the interaction between abuse and genotype. A variety of measures were used to assess antisocial behavior outcomes, and included in the analysis were a range of confounding factors (variables influencing both the dependent variable [antisocial behavior] and the independent variables [MAOA and abuse]). Similarly, G-E interactions were found in some studies on genotype and exposure to cannabis compounds.25 Timing can be critical, as exposure in early adolescence was found to lead to increased psychosis later.26 One can see that here we enter political tiger country because such findings challenge established views and interests. In the instance of cannabis regulation, governments (who hope to tax the drugs) and companies who seek to profit from marijuana sales might wish to downplay the risks, while parents of teenagers

103

Brunette-CH08.indd 103

10/9/19 10:53 AM

CAUSATION

Plaque control

Use of antibacterials

Degree of bacterial contamination

Smoking

Diabetes

Innate woundhealing potential

Genetics Age

Successful treatment of an osseous lesion

Pulp Occlusion Defect morphology

Site characteristics Tooth anatomy

Root-surface preparation

Surgical procedures

Surgical approach Operator skill

Instruments

Fig 8-5 Example of a complex cause-effect relationship. Many factors influence the outcome of treatments for osseous lesions. (Adapted with permission from Kornman et al.29)

might advocate means to restrict supply. Another problematic issue that can hamper the rational discussion of genotype effects concerns race. If a deleterious gene is found in some ethnicities and not others, there can be social pressures to somehow fix the problem, and if resources allocated to research in the area are deemed inadequate by the affected community, there may be charges of racism. Another view might be that all such research on gene distributions and ethnicity be prohibited lest there be findings that indicate that one race will be found to be in some way inferior to another one. In The Mismeasure of Man, Gould has documented various instances in history where science or pseudoscience acted in a prejudicial manner to some ethnicities.27

Complex etiology Complex etiology occurs when multiple direct and indirect causes occur together and, according to Spilker,28 operates in almost all clinical situations (Fig 8-5). The practical ability to study and effectively interpret a particular phenomenon limits consideration to the major direct cause(s) and few (if any) indirect causes. The direct investigation of cause-effect relationships is possible when one causal factor is dominant in the particular situation being investigated. The effect of an antibiotic on a microbial infection may be studied, even though the immune system may also be active against the infection, because, over a short time, the effect of the antibiotic is much greater than the effect

104

Brunette-CH08.indd 104

10/9/19 10:53 AM

Application of Criteria for Causality in Different Fields

of the immune system. In some studies, however, the phenomena under investigation are complex, and possibly have various operative causal factors. In these circumstances, a statistical model, such as that used in the MAOA example earlier, may be developed; the adequacy of the model is judged by the proportion of the variance that is explained.

Application of Criteria for Causality in Different Fields A microbiologist’s view: Koch’s postulates Of continuing interest in periodontology is the traditional method developed by Robert Koch in the late 19th century, which is used to prove that a microorganism has caused a disease. The proof is provided if the following criteria exist: 1. The microorganism must be regularly isolated from cases of the illness. 2. It must be grown in pure culture in vitro. 3. When such a pure culture is inoculated into a susceptible animal species, the typical disease must result. 4. The microorganism again must be isolated from such experimentally induced disease. Koch’s postulates fulfill Hume’s criteria. Postulate 1 deals with regularity (Hume’s principle 3). Postulate 3 establishes a time sequence (Hume’s principle 1) and, at the same time, provides a reason for believing that there is a mechanism whereby the organism can produce the disease (Hume’s principle 2). The other two postulates ensure the validity of the microbiologic techniques. Interestingly, Postulate 3 has been criticized as being a tautology, for a susceptible animal is one that is identified by the presence of the disease.30

A clinical epidemiologist’s view: Sackett’s diagnostic tests for causation The establishment of cause-effect relationships is an exercise in inductive reasoning and can never be certain. Different authorities, then, may require

different degrees of certainty. Some members of the cigarette industry might reject medical authorities’ conclusions that establish a link between smoking and lung cancer. Technically, they would be correct, because stringent proof of a link would require true experiments in humans, demonstrating the effects of smoking on lung cancer. Such an experiment could not be done ethically, but, using less stringent criteria, reliable authorities have concluded that a link between cancer and smoking exists. Causation is often discussed in relation to the etiology of disease; however, cause-effect reasoning also applies to treatments. A successful treatment, after all, is one in which the treatment caused the success. In the discussion below, the term outcome of interest refers to effects that may be desired (as in clinical treatments) or undesired (as in disease). Sackett31 has proposed and ranked the following nine diagnostic tests for establishing causation. The relative importance of each test is indicated here by the number of stars.

Four-star criterion 1. There should be evidence from true experiments in humans. In true experiments, identical groups of individuals, generated through random allocation, are and are not exposed to the putative causal factor and are observed for the outcome of interest.

Three-star criteria 2. The strength of the association must be determined. In other words, what are the odds favoring the outcome of interest with, as contrasted to without, exposure to the putative cause? The calculation of the strength of association is covered in chapter 18. 3. There must be consistency. Similar studies must have similar findings, as has occurred in the smoking–lung cancer investigations.

Two-star criteria 4. There must be a consistent sequence of events of exposure to the putative cause, followed by the outcome of interest. 5. There must be a gradient of increasing chance of the outcome of interest associated with an increase in dose or exposure to the putative cause. 6. The cause-effect association must make epidemiologic sense. There should be agreement with the current distributions of causes and outcomes.

105

Brunette-CH08.indd 105

10/9/19 10:53 AM

CAUSATION

One-star criteria

A logician’s view

7. The cause-effect association should make biologic sense. There should be some mechanism that explains the effects. 8. The cause-effect relationship should be specific. There should be a single cause related to a single effect. (Sackett admits that this criterion is a weak one.) 9. The last and least of Sackett’s tests is analogy, the similarity to another previously demonstrated causal relationship.

Walton has suggested seven types of critical questions for strengthening an argument from correlation to causation.33

Sackett’s criteria pass Hume’s test: Criterion 1 deals with temporality; many of Sackett’s other criteria relate to Hume’s criterion of a history of regularity, and the diagnostic tests of biologic and epidemiologic sense satisfy Hume’s criterion 2. In addition, Sackett’s criterion 5 is similar to Mill’s principle of concomitant variation. Others might weigh differently the relative importance of the various tests. Nevertheless, I think most would agree that these diagnostic tests can be used to increase the efficiency of a literature review, because they focus the reader’s attention on the papers that will shed the strongest light on the causal question.

A pharmacologist’s view: Venning’s five types of convincing evidence The criteria for establishing convincing evidence can vary between disciplines. Drug testing has received considerable attention, and criteria have been clearly thought out. Venning32 proposed five types of convincing evidence for establishing a cause-effect relationship between a drug (cause) and an adverse reaction (effect): (1) rechallenge data, (2) dose-response data, (3) data from controlled studies, (4) experimental data on mechanisms of pathogenesis, and (5) close association in time and space. In his view, strong evidence for an adverse effect comes from (1) uniqueness of the adverse event, (2) extreme rarity of the adverse event in the absence of drug usage, and (3) improvement after withdrawal without rechallenge. Like Koch’s postulates, these criteria incorporate elements specific to the discipline; for pharmacologists, dose-response curves are important. But the criteria also require more elaborate kinds of evidence than the simple time course considered by Hume. As always, the evidence becomes stronger when different techniques or approaches corroborate each other.

1. Is there a positive correlation between A and B? 2. Are there a significant number of instances of the positive correlation? 3. Is there good evidence that the causal relationship goes from A to B? And not B to A? 4. Can it be ruled out that the correlation between A and B is accounted for by some third factor (a common cause) that causes both? 5. If there are intervening variables, can it be shown that the causal relationship between A and B is indirect (mediated through other causes)? 6. If the correlation fails to hold outside a certain range of causes, then can the limits of this range be clearly indicated? 7. Can it be shown that the increase or change in B is not solely due to the way B is defined, the way entities are defined as belonging to the class of Bs, or changing standards over time of the way Bs are defined or classified?

Practical Suggestions for Assessing Causation in Scientific Papers Read the original article. Newspaper columns on health topics need to inform and entertain their readers and must do so in limited space. To be one step ahead of your patients, you would be well advised to read the original source and apportion your trust to some extent on the basis of the quality of the journal. Refereed journals provide the reader with a guarantee not of the absolute truth of the conclusions but that at least reasonable current standards in the field have been met. 1. Having identified the study design (by reading the original research), determine if the disadvantages, such as potential biases, apply to the study. Various study designs are discussed in later chapters. 2. Apply the criteria for causation. As noted earlier, the basic Hume-Mill criteria are that (1) the exposure preceded the appearance of disease; (2) there is a plausible mechanism that could explain the relationship; (3) there is a statistical association (such as odds ratio or relative risk) between the exposure

106

Brunette-CH08.indd 106

10/9/19 10:53 AM

Ends and Means

3.

4.

5.

6.

and the disease that is statistically significant (P value or confidence interval) and of reasonable size; and (4) there are no equally plausible alternative hypotheses that could explain the association. This requirement often entails assessing whether all confounders have been measured and measured accurately. Develop your own scientific standards for aspects such as the strength of the association (to be discussed in later chapters) that you think is required for an exposure to be considered a possible cause for an outcome. Follow the research forward and backward. Resources such as the Science Citation Index, Scopus, or Google Scholar can indicate if the study is finding acceptance, as evidenced by being cited positively in subsequent articles. However, it often takes considerable time for studies to be cited because of the time needed to complete research and publish. To assess research on a topic that has taken place in years past, look for systematic reviews or meta-analyses. These help you to develop a balanced view on the totality of the evidence and use the criterion of consistency. However, such reviews might have inclusion criteria or subjective judgments on research quality with which you might not necessarily agree. Pay particular attention to those paragraphs in the article where the authors discuss the limitations of their study. In some instances, this discussion of limitations is done in response to the criticisms of expert referees and should be considered seriously. Consider alternative explanations, such as confounding, for an observed association. Apply Walton’s criteria listed in the preceding section. Accept the Roman orator Cicero’s advice and ask, “Cui bono?” This double-dative Latin adage asks you to consider who is going to benefit from a suggested course of action and whether the author (or speaker) is acting on the principle of self-interest rather than presenting an objective evaluation of the evidence.

More details on this topic can be found in my review article on causation, association, and oral health– systemic disease connections.21

Ends and Means Once a cause-effect relationship is established, the cause can be introduced as a means to acquiring an end that is the cause’s effect. For example, the discovery

that replacing sucrose with an artificial sweetener in rat diets reduces caries suggests that various products for human consumption might be modified. Sucrose reduction by replacement with artificial sweeteners then might be considered a means to a desirable end (caries reduction). The first rule in considering whether to introduce a means to obtain an end is to consider the other consequences before adopting the means.34 In the case of sweeteners, side effects may be a concern; some link saccharin to cancer. Replacing sucrose with saccharin in some products might not be possible, because it requires less bulk to produce the same amount of sweetness. There might be problems in obtaining enough saccharin or finding new jobs for sugar workers. This list of possibilities could continue, but only after considering all the possible consequences can a wise decision be made. The second rule of contemplating ends and means is to consider the alternative means to attaining the desired end. In the example, different artificial sweeteners, such as aspartame or sorbitol, or different methods of reducing sucrose intake might be examined. The third rule is that before adopting any means to a given end, consider the complexity of the causal pathways, which determines the reliability of the predictions based on the assumptions underlying the proposed means. One knows for sure that one can boil water by supplying heat to a kettle, but what about the effects of “workfare” forcing people to work to receive welfare? Does “workfare” enable recipients to move off welfare and into the workforce permanently? Or would it be more effective to employ kinder, gentler means such as education? Social programs operate in an environment of high causal density. So it is quite easy not to know about all the variables involved in determining the response to an intervention, such as introduction of a program. Moreover, there is often intercorrelation among the variables, such as between education and income, as well as interaction effects (and with many variables there are multitudinous possible interactions). These factors tend to weaken the ability of regression models to make reliable useful predictions. Finally, there is also the problem that Manzi13 describes as the “centrality of many human minds and wills in the causal pathway,” which in practical terms means that humans may make choices that are to their own personal benefit but do not help achieve the aims of a program. A simple example is a salesman whom I once knew whose employer paid for the gas he purchased as he made the rounds of customers or potential customers. Strangely, even though his car did not

107

Brunette-CH08.indd 107

10/9/19 10:53 AM

CAUSATION

require premium gas, he always purchased premium and billed it to his employer. The reason he did so is that the service station where he filled up had a points program so that the more he spent on gas the more points he received. He could convert these points to free coffee and doughnuts. As the conversion rate was about 1.5 cents per dollar spent, this was an extremely expensive way to buy coffee, but he got the coffee, and his employer got the bill. Overall, Manzi concluded that “unlike physics or biology, the social sciences have not demonstrated the capacity to produce a substantial body of useful, nonobvious, and reliable predictive rules about what they study—that is, human social behavior, including the impact of proposed government programs.” In his mind, the way forward would be greater utilization of randomized field trials involving many fast, cheap tests in rapid succession, an approach that has proved effective in providing business solutions in high causal density environments.

A commonsense view of explanation Insofar as hypotheses represent provisional explanations of observations, it is worth examining explanation as a concept. Beardsley34 has outlined the process of explanation. In brief, there are three essential ingredients: 1. Something to be explained, termed the explainee; eg, pain in a tooth. 2. Something that does the explaining, termed the explainer; eg, presence of deep decay and pulpal irritation. 3. A conditional bridge that links the explainee and the explainer; eg, if there is deep decay and pulpal irritation, the tooth will be painful. Selecting the right explanation depends on systematically ruling out the various alternatives. In our example, there are explainers besides decay, such as gingival recession, which could produce sensitivity in the tooth in appropriate circumstances. Some explainers are simply not possible, and some could be ruled out by direct examination. If explainers contradict the facts, they could be rejected (eg, if no decay were found on the tooth). Other explanations might fail because the conditional bridge is weak. Explaining the pain in the tooth by means of the phase of the moon would stretch credulity. When the obvious faulty explanations have been ruled out, the problem becomes that

of choosing between possible explainers. In a commonsense approach, Beardsley34 recommends asking three questions: 1. How common is the explainer? Some explainers could be rejected because they are so rare. Pain in a tooth may be referred pain from other sources, but this happens so rarely that one is unlikely to encounter it. Dental decay, on the other hand, is common. The most common explainer may not locate the truth, but it is a good place to start looking. 2. How simple is the explanation? Perhaps the tooth pain could be explained by the possibility of an electrochemical interaction between the amalgam fillings and other restorations in the mouth. Such an explanation might require specific concentrations of materials in the restorations, particular properties of the saliva, and a particular kind of diet. The logical principle to follow is to reject the more complicated explanation if a simpler one can do the job. 3. Does the explanation check out? If the explanation is correct, it has consequences that can be tested. In the example, gingival recession can lead to pain in a tooth, but such pain is associated with exposed dentin and has different characteristics from the pain associated with decay.

References 1.

Koslowski B. Theory and Evidence: The Development of Scientific Reasoning. Cambridge: MIT, 1996:264. 2. Trankall A. Reliability of Evidence. Stockholm: Beckmans, 1972. 3. Purple blues. The Economist 30 September 2017:75–76. 4. Bronowski J. The Common Sense of Science. New York: Vintage Books, 1978. 5. Coady CAJ. Common sense. In: Honderich T (ed). The Oxford Companion to Philosophy. New York: Oxford University Press, 1995:142. 6. Wikipedia. I know it when I see it. https://en.wikipedia.org/wiki/I_ know_it_when_I_see_it. Accessed 1 August 2018. 7. Mason SF. A History of the Sciences. New York: Collier Books, 1979:255. 8. Hope V. Thomas Reid. In: Honderich T (ed). The Oxford Companion to Philosophy. New York: Oxford University Press, 1995:774– 775. 9. Hume D. A Treatise of Human Nature, ed 2. Oxford: Clarendon Press, 1978:173–175. 10. Hume D. An Enquiry Concerning Human Understanding (1772). http://www.marxists.org/reference/subject/philosophy/works/ en/hume.htm. Accessed 1 August 2018. 11. Stebbing LS. A Modern Introduction to Logic. London: Methuen, 1942:339.

108

Brunette-CH08.indd 108

10/9/19 10:53 AM

References

12. Manzi J. What social science does—and doesn’t—know. City Journal. Summer 2010. 13. Manzi J. Uncontrolled. New York: Basic Books, 2012:142. 14. Mackie JL. The Cement of the Universe: A Study of Causation. Oxford: Clarendon, 1980. 15. Boksman L, Santos GC. Principles of light-curing. Inside Dent 2012;8(3). 16. Price RB. Avoiding pitfalls when using a light-curing unit. Compend Contin Educ Dent 2013;34:304–305. 17. Stone GK. Evidence in Science: A Simple Account of the Principles of Science for Students of Medicine and Biology. Bristol: John Wright & Sons, 1966:83–86. 18. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology: A Basic Science for Clinical Medicine, ed 2. Boston: Little Brown, 1991. 19. Smith GD, Phillips AN. Confounding in epidemiological studies: Why “independent” effects may not be all they seem. BMJ 1992;305:757–759. 20. Krieger N. Epidemiology and the web of causation: Has anyone seen the spider? Soc Sci Med 1994;39:887–903. 21. Brunette DM. Causation, association, and oral health–systemic disease connections. In: Glick M (ed). The Oral-Systematic Health Connection: A Guide to Patient Care. Chicago: Quintessence, 2014:13–48. 22. Kolata G. A new approach to cystic fibrosis. Science 1985; 228:167. 23. Migliorati CA. Bisphosphanates and oral cavity avascular bone necrosis. J Clin Oncol 2003;21:4253–4254.

24. Ferguson DM, Boden JM, Horwood LJ. Miller AL, Kennedy MA. MAOA, abuse exposure and antisocial behaviour: 30 year longitudinal study. Br J Psychiatry 2011;198:457–463. 25. Cosker E, Schwitzer T, Ramoz N, et al. The effect of interactions between genetics and cannabis use on neurocognition. A review. Prog Neuropsychopharmcacol Biol Psychiatry 2018;82:95–106. 26. Fields RD. Link between adolescent pot smoking and psychosis strengthens. https://www.scientificamerican.com/article/ link-between-adolescent-pot-smoking-and-psychosis-strengthens/. Accessed 1 August 2018. 27. Gould SJ. The Mismeasure of Man. New York: Norton, 1981. 28. Spilker B. Guide to the interpretation of clinical data. New York: Raven Press, 1986:19–26. 29. Kornman KS, Robertson PB. Fundamental principles affecting the outcomes of therapy for osseous lesions. Periodontol 2000 2000;22:22–43. 30. Skrabanek P, McCormick J. Follies and Fallacies in Medicine. Buffalo: Prometheus, 1990:30–31. 31. Sackett DL. Evaluation: Requirements of clinical application. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:123–140. 32. Venning GR. Identification of adverse reactions of new drugs. III. Alerting processes and early warning systems. Br Med J (Clin Res Ed) 1983;286:458. 33. Walton D. Informal Logic: A Pragmatic Approach, ed 2. Cambridge, Cambridge University Press, 2008:277–278. 34. Beardsley MC. Writing with Reason: Logic for Composition. Englewood Cliffs, NJ: Prentice Hall, 1976:122–151.

109

Brunette-CH08.indd 109

10/9/19 10:53 AM

9 Quacks, Cranks, and Abuses of Logic



We do not think it necessary to prove that a quack medicine is poison; Let the vendor prove it to be sanative.” THOMAS BABINGTON MACAULAY 1

Three Approaches to Medical Treatment Three approaches contribute to our understanding of medical treatment: anecdotal evidence, the numerical method, and the pathophysiologic approach. To some extent, these approaches have been in conflict.2 The first two approaches are empirically driven, while the third is theory driven. Personal experience related by anecdote is reliable when effects are clear, large, and occur quickly. For example, experienced dentists often teach as part-time instructors in dental school clinics, because, among other reasons, they can provide practical know-how that is not easily delivered in lectures. Students can try the techniques recommended by these clinicians and can readily determine whether they work, at least in the short run. However, anecdotal evidence does not work well for many chronic conditions, where the effects are not clear-cut or are expected to occur over the long run. Another empirical method, the numerical approach, developed by Pierre Charles Alexandre Louis in the 1830s, relied on careful observation and the collection of statistics. Louis found, for example, that patients who were bled early in their treatment for typhoid fever fared worse than those who were not—a finding counter to the theories of the day. Opponents of the numerical approach objected to the lack of consideration given to the unique nature of disease in individual patients. The positivist philosopher Auguste Comte suggested that reliance on a “theory of chances” would reduce practitioners to a servile status, whereby they would have to accept ideas imposed upon them by professors who had collected large numbers of observations.3 Evidence-based medicine is the heir to the Louis tradition, and, to some extent, it has attracted the same kind of criticisms that Louis endured. For example, Louis’ opponents worried about ignoring the uniqueness of each patient and questioned whether data collected on the Paris poor could be applied to private practices treating the

110

Brunette-CH09.indd 110

10/9/19 11:03 AM

Three Approaches to Medical Treatment

affluent. Moreover, at that time, Louis lacked the statistical expertise that informs present-day evidence-based medicine. Nevertheless, concerns persist about the validity of a doctrinaire rules-based approach, based on assessment by panels of experts far removed from the actual patient-physician encounter, which typically involves thoughtful analysis of the underlying mechanisms of disease.4 Developed by Claude Bernard, the pathophysiologic approach is experiment-driven, deterministic, and reductionist. In this approach, the control of disease depends on an understanding of physiologic and pathologic mechanisms. Today, the pathophysiologic approach dominates the search for drugs and other treatments and underpins the vast biomedical research enterprise. A mechanism-based approach to developing a new analgesic drug comprises eight steps,5 the first four of which follow true to the mechanistic/reductionist spirit: 1. Establishment of a biologic hypothesis of the pathophysiology of pain (such as prostaglandin E2 release from inflamed tissue). 2. Identification of a potential molecular target (such as inhibition of the enzyme COX-2). 3. Establishment of a screen to look for small molecules that bind to the target. 4. Chemical optimization of any promising molecule found from the screen. 5. Test in animals. Claude Bernard would not be hesitant to undertake the fifth test as well. But at this point, the possibilities of complex interactions become more probable. The drug may have unforeseen effects if it interacts with molecules other than the target. The complexity becomes even more worrisome when the drug is applied to human populations in the final three steps. The demonstrated need for animal and subsequent human testing reflects the lack of predictability of treatments based solely on physiologic mechanisms. Modern cell biology has revealed a massive complexity in cell signaling and processing of stimuli, and Swales2 goes so far as to suggest that this complexity makes claims to predictive value impossible. Thus, in the last analysis, human testing and principles of evidence-based medicine will eventually be applied. 6. Test for pharmacokinetic properties and safety in humans. The US Food and Drug Administration (FDA) nomenclature for this step is Phase I. Such studies are closely monitored and are usually (but not always) conducted in healthy

volunteer subjects. The objective is to determine the metabolic and pharmacologic actions of the drug in humans and the side effects associated with increasing doses and, possibly, to gain early evidence on effectiveness. 7. Test for efficacy in small numbers of patients (around 100 to 300). The FDA nomenclature for this step is Phase II, and, commonly, new drugs fail because of poor efficacy or toxic effects at this stage. 8. Test in large-scale (probably greater than 1,000) multicenter clinical trials. Typically, these Phase III studies are double-blind, randomized controlled trials. Sometimes complications are seen after the drug is released onto the market; therefore post-launch safety surveillance must be involved in Phase IV. This was the case when drug maker Merck recalled Vioxx (a COX-2 inhibitor used in treating arthritis) because it was linked to heart problems.

Two questions about any treatment There are two basic questions that must be asked about any treatment: 1. Does the treatment work? This is the primary question asked in evidence-based medicine or dentistry. Other chapters discuss concepts, such as asking appropriate questions, searching for the evidence, evaluating the standard of evidence within a hierarchy, research design, and meta-analysis, that will help the reader uncover this information and assess it. 2. How does a treatment work? This question relates to the pathophysiologic model of Claude Bernard and his successors, such as physiologists, biochemists, and molecular biologists. Established medicine and dentistry form a coherent approach to understanding and treating health problems. To return to our pain example: There is evidence that inflamed tissues release prostaglandins; the chemical structures of the prostaglandins are known; and many enzymes that participate in their metabolism have been identified. Moreover, the effectiveness of inhibitors of these enzymes in reducing pain is known. In many instances, histologic studies have characterized the possible negative effects of high concentrations of these inhibitors on tissues. So, when we take an aspirin or ibuprofen to interfere

111

Brunette-CH09.indd 111

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

The Ideal

Strong empirical evidence

Strong underlying theory

EBM-pathophysiologic partnership Theory Driven

Empirically Driven

Strong empirical evidence

Weak underlying theory

“Doctrinaire” EBM

Some empirical evidence

No theory

Home remedies

Some empirical evidence

Bad theory

Strong underlying theory

Weak empirical evidence

Drugs or treatments in Phase II or III FDA trials

No empirical evidence

Strong underlying theory

Drugs or treatments in Phase I FDA trial

Bad empirical evidence

Chiropractic

with prostaglandin metabolism, we can be confident of relief of at least some types of pain. Failures— and discomfort—still occur, but, as the processes involved are further understood, research can lead to better inhibitors and reduced side effects.

Bad theory

Quackery

Fig 9-1 Bicycle metaphor for common theory-evidence relationships. A treatment is most reliable when both the empirical evidence for its effectiveness and the underlying theory or mechanism explaining why it works are strong. EBM, evidence-based medicine.

Relationships between theory and empirical evidence Figure 9-1 illustrates some common theory-evidence relationships using a bicycle metaphor. The ideal occurs when there is strong empirical evidence for

112

Brunette-CH09.indd 112

10/9/19 11:03 AM

Three Approaches to Medical Treatment

a treatment’s effectiveness and a strong underlying theory or mechanism explaining why the treatment works. An example would be treating minor headaches with aspirin. Both empirical evidence and plausible mechanisms of action are present, and we can be confident that the treatment is safe and reliable, just like a regular bicycle. However, it can happen that only one component is strong, and the other is weak—leading to a relationship that can be compared with a penny-farthing bicycle. The penny-farthing bicycle had a tendency to crash when it hit unpredictable situations, such as potholes, and the rider “came a cropper,” a phrase originally used in reference to equestrians falling headfirst, but in this instance, over the handlebars. At times, there is strong empirical evidence that a treatment works, but the reason the treatment works is obscure. An example of this situation is afforded by Emdogain (Straumann). There is evidence for the effectiveness of this preparation at the highest level of the hierarchy of evidence6 (ie, systematic review), but some have questioned whether the result is primarily attained because of the physical nature (gel) of the preparation or because of the presence of the supposed active ingredient. Until this question is resolved, a clinician might be unsure of applying the preparation in situations where the barrier effect provided by the gel might be compromised. Simply put, if either theory or evidence is weak, we cannot predict as reliably as would be desired. In some cases, there may be either no credible theory or no empirical evidence (exemplified by riding a notoriously unstable unicycle); falls are inevitable in such situations. Almost daily, there are news reports of breakthroughs in the treatment of disease based on experiments in rodents or in vitro studies. The understanding of mechanism in such instances may be very strong, but the predictability of successful application in the absence of evidence from human studies is very low. An FDA official estimated that a drug starting human trials in the year 2000 was no more likely to reach the market than one entering trials in 1985 (roughly an 8% chance), and the product failure rate for drugs in Phase III trials has increased to nearly 50%.7 The condition can also occur when there is some empirical evidence to support the use of a treatment that, unfortunately, is based on a faulty theory; chiropractic comes to mind as an example. In a review of chiropractic published in a book generally favorable to alternative medicine, a practitioner, Redwood,8 states, “however, positive health changes have never been convincingly correlated with vertebral alignment. . . . [A]lternative hypotheses are necessary to replace the

outmoded bone-out-of-place concept.” Thus, the theory underlying chiropractic is viewed as outmoded and incorrect. Chiropractic treatment cannot be expected to be progressively improved, because the current theory cannot fruitfully guide its future development. Finally, sometimes evidence for the effectiveness of a proposed treatment is either not available or of poor quality, and the theory underlying the treatment is obviously erroneous. This situation is the natural domain of quackery and scientific cranks. Homeopathy, as it is typically applied today, involves dilutions of such magnitude that not a single molecule of the initial solution remains. The theory that the original solution somehow leaves its trace in the water in subsequent solutions—the ad hoc theory used to explain how homeopathic preparations work—is generally regarded as implausible. Believers in homeopathy face an awkward dilemma—either the established laws of chemistry and physiology are wrong, or the purported mechanisms of homeopathy are wrong; the two cannot exist together.9 Thus, homeopathy is a quintessential example of a bad theory, although it appeared to have some validity from an evidence-based perspective when a 1997 meta-analysis of homeopathy generated the surprising finding of an overall benefit.10 The study received abundant criticism, however, as mechanistically inclined investigators were incredulous and examined the study more closely. Subsequent analysis of homeopathic effects indicated that, as a general rule, studies showing a benefit were of lower quality and that, when study quality was more rigidly controlled, the findings were compatible with the notion that the clinical effects of homeopathy are placebo effects11—a finding that prompted a Lancet editorial entitled “The end of homeopathy.”12 The Lancet editors were probably guilty of wishful thinking; commercial forces with an interest in promoting homeopathy doubtless will keep the question active for some time. In summary, the credibility of any theory or study depends on its consistency with diverse sources, particularly sources supported by a large body of evidence. Perhaps building on Kuhn’s13 description of science as puzzle solving, the philosopher Susan Haack14 likened interpretation of data to solving a crossword puzzle; the clues are analogous to the observational or experimental evidence, and the entries are the analogue of currently accepted background information. The two sources have to blend with and reinforce each other. Finally, it should be noted that bad theories have plagued medical doctors over the centuries from the time of Hippocrates in the fifth century bce; Wootton15 argues that until the invention of antibiotics in the

113

Brunette-CH09.indd 113

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

1940s, doctors in general did their patients more harm than good. But unlike some of the quack approaches to solving patient problems, medicine is now structured to advance through research, and medical/dental research has yielded vast benefits to the populace in the developed world.

Scientific Cranks Some investigators work in such isolation that they are unaware of what is happening—or has happened—elsewhere. The work of these investigators can be seriously flawed, for inductive logic requires that a researcher consider all available evidence. Often, scientists working in isolation publish in obscure journals or write books that are published by nonacademic presses. A fascinating subset of the isolated scientist is the crank who may publish internally consistent work without reference to what others have found. Gardner16 has written entertaining books in which he documents the classical characteristics of scientific cranks. More recent books17 on cranks and quacks that employ critical concepts from evidence-based medicine include Goldacre’s Bad Science18 and Bausell’s Snake Oil Science.19 In any case, Gardner’s criteria for identifying cranks are given below: • Consider themselves geniuses but regard colleagues as ignorant blockheads • Consider themselves unjustly persecuted • Attack the greatest scientists and best established theories • Tend to write in complex jargon, often of their own invention • Contribute to journals that they edit • Publish books privately or with nonacademic publishers Cranks most likely to disturb dentists are some of the antifluoridationists. The argument for fluoridation is convincing not only to dentists and to dental researchers but also to eminent scientists of other disciplines, who are in a position to consider the issue in its widest context. Sir Peter Medawar, acting as director for the National Institute of Medical Research (UK), said of the fluoridation issue, “Every time an American municipality determined against fluoridation there was a little clamor of rejoicing in the corner of Mount Olympus presided over by Gaptooth, the God of Dental Decay.” Medawar20 states that:

The more difficult part of the fluoridation enterprise is not scientific in nature. I mean that of convincing disaffected minorities that the purpose of the proposal is not to poison the populace in the interests of a foreign power or to promote the interests of a local chemical manufacturing company. Although fluoridation is a complex issue, there certainly has been a tendency for both sides to argue simplistically. Thus, in considering arguments in support of or against fluoridation, remember that the most obvious flaw is that they most likely will not consider all the evidence. Lifesavers Guide to Fluoridation by Yiamouyiannis,21 an active opponent of fluoridation, cites 250 references in support of its claims. When the citations were examined critically by experts, many were found to be irrelevant to community water fluoridation, while others represented unreplicated or refuted research. Some references that supported fluoridation were selectively quoted and misrepresented. Moreover, only 48% of the English-language articles cited came from refereed journals, and a large percentage of the articles were published in outdated or obscure journals.22 Cranks do not lack perseverance. The editor of the British Dental Journal, Grace,23 wrote several editorials on the benefits of fluoridation; as they contained judgments of arguments made by specific people, these were checked very carefully by lawyers for the British Dental Journal. Thus, the editorials were very carefully worded. Nevertheless, the British Dental Journal was sued by Yiamouyiannis in ongoing procedures that ended only when the plaintiff died. Crank science can have tragic effects. Adele Davis is generally regarded by nutrition scientists as a major source of nutrition misinformation. Some reports suggest that newborn infants died as a result of their parents’ following of Davis’ principles of nutrition.24–26 However, honest mistakes also occur. The extant literature is so large and scattered that it is not difficult for an author to be unaware of an article’s existence and fail to cite it. Nevertheless, the goal of a good investigator is to consider all the available evidence.

Quacks The term quack has come to mean any dishonest, ignorant, or incompetent practitioner, regardless of formal training. Dentists who practice “holistic” dentistry that

114

Brunette-CH09.indd 114

10/9/19 11:03 AM

Quacks

includes unproven or refuted methods are considered quacks. The term quackery is derived from the word quacksalvers, who were wandering peddlers selling mercury ointments as treatment for syphilis during the Renaissance era. The term was defined for a report on the US House Select Committee on Aging’s Subcommittee on Health and Long-Term Care by Congressman Claude Pepper as “anyone who promotes medical schemes or remedies known to be false or which are unproven, for a profit.”27 Pepper’s definition of quackery poses a problem: If stringent criteria for proof were adopted, some accepted medical or dental treatments might well be found lacking, even though they are considered the best current treatments. In the view of Skrabenek and McCormick,28 quackery can be distinguished from rational therapy in that it does not derive from any coherent or established body of evidence, and it is not subjected to rigorous assessment to establish its value.

Bad science in complementary and alternative medicine Complementary and alternative medicine (CAM), as defined by the National Center for Complementary and Alternative Medicine (NCCAM), which has been renamed National Center for Complementary and Integrative Health (NCCIH) of the US National Institutes of Health (NIH), is a group of diverse medical and health care systems, practices, and products that are not considered part of conventional medicine.29 Complementary medicine is used together with conventional medicine, such as when aromatherapy is used to lessen discomfort after oral surgery. Alternative medicine is used in place of conventional medicine. Practitioners of unconventional medicine constitute an economically important group, for the cost of unconventional medicine was estimated to be at least $13.7 billion in 1990. About one in three respondents to a large survey reported using unconventional medicine, mainly for chronic conditions.30 More recent data indicate an increasing use of CAM; a survey funded by NCCAM found that 75% of respondents turned to CAM at some point in their lifetimes, including 62% who had used it the previous year.31 Relatively recently, the NIH established an NCCAM center dedicated to exploring complementary and alternative healing practices in the context of rigorous science, training CAM researchers, and disseminating authoritative information to the public and to professionals.29 NCCAM classifies alternative medicine into

five categories: (1) alternative medical systems, such as homeopathic medicine and naturopathic medicine; (2) mind-body interventions, such as therapies that use creative outlets, such as art, music, or dance; (3) biologically based therapies, such as dietary supplements; (4) manipulative and body-based methods, such as chiropractic; and (5) energy therapies, such as therapeutic touch. The evidence for effectiveness and, indeed, plausibility of these methods varies widely. Clearly, NCCAM is attempting to approach some of these therapies on a rational, evidence-based basis, by funding studies that test their effects. In the document providing justification for its 2006 fiscal year activities,32 NCCAM highlights some results of NCCAM-funded studies: • “Story of Discovery: Ancient Acupuncture Provides Modern Pain Relief for Arthritis” (report on a Phase III trial). However, the article notes that all patients also received standard anti-inflammatory medications. • “Science Advance: Popular Echinacea Product Not Effective in Treating Pediatric URIs.” • NCCAM also supports the largest randomized Phase III clinical trial to date of the potential of gingko biloba to prevent dementia in the elderly. External evaluators of NCCAM have not been impressed. For example, Marcus and Grollman 33 published an article in Science that concluded “NCCAM funds proposals of dubious merit; its research agenda is shaped more by politics than by science; and it is structured by its charter in a manner that precludes an independent review of its performance.” Offit34 gives numerous examples where center funding has not yielded any valuable scientific insights such as “$406,000 to determine that coffee enemas don’t cure pancreatic cancer.” In revising this chapter, I wrote to NCCIH and received the following reply: Thank you for your e-mail to the National Center for Complementary and Integrative Health (NCCIH) asking for examples of research funded by NCCIH that have been rigorous and had successful outcomes and that have also been published in peer-reviewed journals. You may want to take a look at our Web page “Research Results” (at nccih. nih.gov/research/results). On the left side of this page, there are “Recent Study Results—Summaries of Selected Studies.” This includes recent articles on NCCIH-funded research. And on the right, there’s a link for “NCCIH-funded studies in

115

Brunette-CH09.indd 115

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

PubMed.” Or you may wish to conduct your own search in PubMed (at http://pubmed.gov). To find out more about research we are funding, with accompanying research results and publications, you may want to search the NIH Reporter database at projectreporter.nih.gov/reporter.cfm. You can select Nat’l Center for Complementary and Integrative Health (NCCIH) in the “Agency/ Institute/Center” field and click the checkbox next to “Funding.” The NIH RePORTER, an online database of federally funded research projects, includes descriptions of projects and published research results. Inspection of the web pages indicates that some of the research would be considered mainstream, such as exploring pain mechanisms, and is reported in appropriate journals (such as Neuron). There appears to be more negative results (ie, no or small effects) than occurs in research published in traditional biomedical disciplines. For example the page on ginkgo states “There’s no conclusive evidence that ginkgo is helpful for any health condition” (despite the large study sponsored by the NCCIH), which is a sign that the NCCIH can be objective. My overall assessment is that despite the years past since NCCAM/NCCIH’s entry into the arena in 1999 and the availability of considerable funds to support numerous research projects, there appears to be scant therapeutic return on the NCCAM/NCCIH investment, but at least some attention is being paid to discovering whether there is compelling evidence to support therapies that have considerable public acceptance. A continuing problem with many alternative medical approaches remains their development outside the normal peer-reviewed structure of science, where publications are evaluated by skeptical experts and where experiments are repeated and, through repetition and refinement, come to be accepted. An acceptable scientific hypothesis, according to the philosopher Hempel,35 is internally consistent, comprehensive, testable, simple, novel, and predictive. Some of these criteria, such as simplicity and novelty, are easily met by alternative therapies. For example, the bone-out-of-place chiropractic theory, while anatomically erroneous,36 is simple and was novel in its time. However, other criteria often pose formidable problems for alternative medical approaches, leaving them open to quackery.

Comprehensiveness An essential criterion for the acceptability of inductive arguments is the consideration of all available evidence before reaching a conclusion. However, quack and some alternative practitioners seem to live in worlds of their own creation, using concepts unique to their approach. Acupuncture, for example, posits the existence of meridians along which flows Qi (pronounced “chi”), but the meridians do not correspond to any known physiologic or anatomic pathways. Researchers are investigating neurochemical mechanisms that might integrate acupuncture into medical science, but questions about the meridians remain. Skrabanek37 concluded that acupuncture is effective in some patients with functional and psychosomatic disorders, but that the effects of acupuncture are unpredictable, unreliable, and possibly related to hypothesis and suggestion.

Testability Some theories of alternative medicine cannot feasibly be tested, and, indeed, their proponents seem to accept having them remain untested. Two modern apologists for homeopathy note, “This study illustrates the difficulty in doing research in homeopathy. Homeopathic medicines are individualized by definition based on a totality of symptoms. Most conventional clinical research involves administering the same medicine to all patients.”38

Open publication A norm of science is that scientific knowledge is public knowledge.39 Materials and methods, data, and interpretations are published in accessible journals, so organized skepticism and acceptance by consensus occur. Quack science tends to keep secrets. For example, a widely publicized report from India—involving a large number of people undergoing a treatment protocol that includes swallowing live sardines—notes that the exact treatment is a family secret known only by five brothers.40 As noted previously, some publications supported by NCCIH are being reported in mainstream scientific journals.

Lack of objectivity Traditionally, objectivity has been one of the hallmarks of good science. In theory, this means that any qualified scientist could repeat the observations, obtain

116

Brunette-CH09.indd 116

10/9/19 11:03 AM

Quacks

similar data (within the limits of experimental error), and reach the same conclusions as those published by another scientist. Conclusions are generally conservative, in that they do not extend much beyond the data; extravagant speculations are discouraged in scientific publications. In quack science, claims for treatments are anything but conservative. An advertisement for an instrument known as the Zapper claims that it is “the cure for all diseases.” Some psychics claim that results bordering on the miraculous can only be obtained by themselves. Thus, their results can never be repeated, and, even if they did occur, the psychic approach could not be considered scientific. In the absence of scientifically repeatable data, quackery often relies on testimonials based on personal experience.

Why and when personal experience can be unreliable Quackery relies on personal testimony and anecdotal evidence in place of rigorous scientific assessment. Personal testimonies are not used in scientific medicine to prove or disprove a treatment’s effectiveness, because the observations are not made objectively under controlled conditions, making the rigorous assessment of the treatment impossible. Yet, personal testimonies of being cured are a powerful means of influencing opinion because of the common belief in personal experience as the best way of determining whether something works. Personal experience is reliable when the treatment under consideration has a large effect, occurs quickly, and has a clear outcome. Anyone who has hit a thumb with a hammer while driving nails knows from personal experience not to do it again. But the reliability of personal experience declines markedly in instances where the symptoms are variable, the time course is long, the effects of the treatment are complex, and the outcome measures are ill-defined. Oil from croton seeds exemplifies a complex treatment; the oil has cathartic properties but also contains potent tumor-promoting phorbol esters. An individual taking croton oil might deem the preparation effective and swear to its efficacy, not realizing what the long-term effects might be.

Why ineffective treatments may appear to be successful Quacks thrive on treating conditions with variable and ill-defined symptoms, for several factors can make treatment of these conditions appear successful.

Placebo effect Evidence suggests that expectations of a practitioner and patient can markedly affect the outcome of treatment, and this effect is optimum when both practitioner and patient believe strongly that a treatment is efficacious.41 One study42 examined the role of these nonspecific effects by analyzing the results of uncontrolled studies of medical and surgical treatments in which both patients and therapists had expectations of a successful outcome. The particular treatments selected for study were later found to be ineffective when examined in controlled trials. Thus, any effects reported in the uncontrolled trials did not occur as a result of the treatment, but were nonspecific and included factors such as expectancy and belief. These nonspecific effects accounted for good-to-excellent improvement in almost 70% of patients treated. More recent studies have provided substantial evidence that expectation of analgesia is an important factor in the engagement of objective, neurochemical antinociceptive responses to a placebo. The responses appear to be mediated by endogenous opioid neurotransmission in the dorsolateral prefrontal cortex.43 In other words, belief that a treatment will relieve pain causes the brain to respond by releasing its own natural painkillers. Often, quacks’ and patients’ confidence in a treatment produces the placebo effect and causes the treatment to look effective. The placebo effect also benefits practitioners of conventional medicine and dentistry. Beck 44 has reviewed the topic for dentistry and advises dentists on how to use the placebo effect. However, it should be noted that different placebos have different effects. In a comparison of two placebos (a sham acupuncture device and an inert pill),45 the authors conclude that placebo effects are malleable and depend on behaviors embedded in medical46 and dental47,48 rituals. For example, the effectiveness of a sugar pill given before a dental extract was enhanced; that is, there was less fear, anxiety, and pain if the pill was administered with practitioner testimonials of its effectiveness47 compared to pills being presented as marginally effective. The placebo can even exert an effect through the intermediary of the beliefs of the health care provider. In an

117

Brunette-CH09.indd 117

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

elaborate experiment involving the effects of a placebo, a pain reliever, and a pain enhancer on pain experienced in wisdom tooth extraction, when the providers either thought they were delivering a pain killer or at least had the possibility of delivering a pain killer, their patients experienced less pain.48 Bausell19 has considered the voluminous literature on the placebo and delineated three key principles: 1. The placebo is real and is capable of exerting at least temporary pain reduction. The effect depends on the belief that the intervention or treatment is capable of producing pain reduction. The belief can be instilled through classical conditioning or by the suggestion of a respected individual that the intervention can reduce pain. 2. The placebo has a plausible mechanism of action, the endogenous opioid system, for pain reduction. 3. There is no compelling, credible scientific evidence to suggest that any complementary or alternative medical therapy benefits any medical condition or reduces any medical symptom better than a placebo.

Variability Some conditions have high rates of spontaneous remission or long asymptomatic periods (such as herpes simplex virus infection) or are influenced by psychosocial variables. Other conditions are self-limiting, as the body’s natural defenses and reparative mechanisms come into play. If a quack treatment is given before natural recovery, a quack or patient might conclude that the treatment is effective. This exemplifies the post-hoc fallacy, which occurs when an investigator relies on just one (temporality) of Hume’s three criteria for causation.49

Regression Typically, patients seek treatment when their symptoms are bothersome. In a condition with variable symptoms, the intensity of the symptoms likely will regress from the bothersome extreme that sent the patient to a practitioner to the mean value, thereby making the practitioner’s treatment look effective. Thus, a person with low back pain will go to a chiropractor when the pain becomes intolerable and will consider the treatment effective when the problem returns to its normal levels—as it probably would have done without treatment.

Reporting bias People are more likely to report a positive outcome than a negative one for fear of appearing foolish. Who wants to say, “I still feel lousy after spending megabucks on an herbal supplement”? The reporting bias aids quacks, because believers report a quack’s success, whereas the wiser (but sadder and poorer) individuals who recognize the treatment’s failure keep their knowledge to themselves. Reporting bias can also be problematic for conventional treatments. Guided tissue regeneration was once widely recognized as an effective means of achieving regeneration (as opposed to mere repair) of the tooth-supporting structures. However, a systematic review found that about 11 patients needed to be treated to produce one extra site gaining 2 mm or more attachment.50

Complex treatment People who consult quacks often also elect to undertake conventional therapy. Indeed, Eisenberg et al30 found that, among those who used unconventional treatment for a serious condition, 83% also sought treatment from a medical doctor. Thus, a successful outcome may be the result of the conventional treatment, rather than the quack treatment. A more general problem is that treatments can be complex and can be thought of as not only administering a specific agent but also as an associated ritual. As noted earlier, placebo effects seem to be influenced by particular rituals.

Quack strategy and tactics A tactic used by quacks to promote their services is the principle of providing incomplete information. Those effects that warp the judgment of individuals on the effectiveness of treatments occur so commonly and predictably that a strategy for being a successful quack can be codified (see list below). In general, quacks flourish when conventional medicine can offer no cure.

Freirich’s rules for becoming a successful quack51 • Choose a disease that has a natural history of variability. (For those wishing to become dental quacks, temporomandibular joint disorder is a good candidate.)

118

Brunette-CH09.indd 118

10/9/19 11:03 AM

Quacks

• Wait until a period when the patient’s condition is getting progressively worse. • Apply the treatment. • If the patient’s condition improves or stabilizes, take credit for the improvement, then stop the treatment or decrease the dosage. • If the patient’s condition worsens, report that the dosage must be increased, that the treatment was stopped too soon and must be restarted, or that the patient did not receive treatment long enough. • If the patient dies, report that the treatment was applied too late—a good quack takes credit for any improvement but blames deterioration in the patient’s condition on something else. Quacks have been quick to promote cures for AIDS, and a wide range of unproven treatments have been offered, only serving to defraud desperate patients.27 Lacking scientific evidence to support their claims, quacks have developed effective persuasive techniques, some of which are listed below. Some of the tactics employ half-truths. For example, quack tactic 6 (below) advocates claiming that a treatment is recognized in other parts of the world, but not in North America. Indeed, some successful treatment modalities, such as the Brånemark implant, do get developed outside North America. In Sweden, Brånemark obtained very good evidence that the implant system was successful, and some 15 years passed before it was introduced into North America. However, other remedies developed outside North America are fraudulent; it would be highly irrational to go to the Philippines for so-called psychic surgery. The main issue is not where a treatment was developed, but the evidence supporting its use.

A partial list of quack tactics52 • Promise quick, dramatic, painless, or drugless cures. • Use anecdotes, case histories, or testimonials to support claims. • Use pseudoscientific disclaimers, eg, instead of promising to cure a specific illness, claim that the treatment will “detoxify the body,” “strengthen the immune system,” or “bring the body into harmony with nature.” • Use dubious credentials, eg, unaccredited schools. • Claim product or service provides treatment or cure for multiple illnesses or conditions. • Claim secret cure has been discovered or recognized in another part of the world but is not yet accepted in North America.

• Claim to be persecuted by orthodox medicine, eg, claim that doctors are suppressing the cure so they will not have competition. • Claim medical doctors do more harm than good. • Report that most disease is caused by a faulty diet and can be treated by nutritional changes; advocate the need for vitamins and “health foods” for most people. • Use scare tactics to encourage use of product. • Support “freedom of choice,” both of individuals to market unproven and unorthodox methods and of consumers to purchase them.

Evolving quack strategies New avenues Quacks have not been slow to exploit changes in patterns of communication, and now use strategies involving the Internet. Now it is common practice to use a “clickbait” strategy where a simple “trick” or diet is advertised making the same sort of outrageous claims listed above. Moreover, rather than inefficient print media, they use targeted messages through the Internet. For example, using the persuasive principle of liking and the knowledge that people tend to like people like themselves, they might claim that the method was discovered by someone living in the target’s hometown. I have been the target of messages alleging that Vancouver-based individuals have discovered tricks to beat the casinos or remove wrinkles.

Useful idiots It appears that some quacks employ the “useful idiot” strategy developed by the Soviets, in which they cynically identify a well-intentioned person as a potential propagandist for their cause who is in a position to influence popular beliefs but lacks the ability to fully evaluate the quack remedies they end up advocating. Goldacre18 has delineated the strategy of increasing brand awareness that is employed by the alternative health industry (AHI). In brief, the number of journalists is decreasing, and the amount of time that they have to investigate stories has also declined. Moreover, typically journalists are educated in the humanities and have little understanding of science. These conditions enable the AHI to use an effective strategy whereby unexamined claims are made for their products in newspaper columns that readers would

119

Brunette-CH09.indd 119

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

believe to be objective and well informed. In brief, the AHI supplies journalists with information in a form that can be easily transformed into a column on their borderline or pseudo-medical product. Unlike labels on the product, which can be regulated, the opinions of the journalist on the product can be anything the sponsors want it to be. Goldacre’s detective work led him to state “(a nutritional) Company memos described elaborate promotional schemes planting articles on their research in the media, deploying researchers to make claims on their behalf, using radio phone-ins and the like.” Moreover, it should be noted that this deceptive strategy is widely employed; Goldacre cites research conducted at Cardiff University that demonstrated that some 80% of broadsheet news stories involved the use secondhand material from news agencies or the public relations industry.

Rogue professors and rogue departments Although traditionally scientific cranks or quacks were not associated with accredited universities, there is a small cadre of university appointees who hold and profit from views that are contrary to the majority of the members of their discipline. I call them rogue professors, whereas Goldacre designates them Professor ? Such professors, like all professors, benefit from the tradition of academic freedom, namely, freedom of inquiry that enables them to teach or communicate ideas or facts (including those that may not be accepted by other groups in society including those in their discipline) without being repressed or suffering job loss. This principle is widely accepted but carries with it some unfortunate consequences. For example, the job security the university provides may result in some professors drifting to inactivity, but others may become cranks. It turns out to be very difficult to dislodge such crank professors. Another continuing problem in the academic world is the shortage of financial support that leads some departments to promote an entrepreneurial spirit that encourages professors to find funding regardless of its source, particularly funding that carries with it funds for the indirect support of the department.

Cherry-picking studies In the face of adverse criticism, some members of the AHI and CAM practitioners have attempted to provide proper trials of their products or services. The NCCIH for example funds such studies, and the literature on CAM therapies is growing. But the reporting and interpreting

of such results tends to be selective. Goldacre18 has concluded that throughout the industry, the studies that are flawed (such as not incorporating appropriate controls or having low numbers of participants) tend to be the ones that favor homeopathy (or other alternative therapies), but the well-performed studies incorporating appropriate controls show that the therapies are no better than placebos. It is the flawed studies that are cherry-picked to be cited by proponents of alternative therapies. Moreover, when meta-analysis is employed it is found that homeopathy, for example, is no better than placebo. The corollary of cherry-picking studies to highlight the value of CAM therapies is that reports of negative studies will be rare. Goldacre18 reports that only 1% of all articles in CAM journals in 1995 gave a negative result, and that in the entire canon of Chinese medical research not one single negative trial had ever been published. In this respect, it is heartening that (as noted earlier) the NCCIH website does report some negative results. It must be admitted that some CAM strategies do not differ from those employed by rhetoricians in the mainstream biomedical disciplines in that various means are sought to enhance the perceptions of effect size (such as the judicious use of scales and ratios). So the skills to attack some quack strategies and tactics are also valuable in critiquing mainstream articles.

Quackery and dubious dental care It is unlikely that you think your gluteus maximus influences your mandibular first premolar; however, a Journal of the American Dental Association article on dental quackery53 makes note of a quack dentist inferring such a ludicrous relationship. Dental quackery is performed not only by outright quacks but also by dentists who stray from accepted procedures and attempt fringe treatments. In a report prepared for the American Council on Science and Health, Dodes and Barrett,54 leaders in the fight against quackery, state that experts on dental health fraud suspect that over $1 billion a year is spent on dubious dentistry. A brief summary of their report follows.

Dubious credentials To be recognized as a specialist, dentists must undergo at least 2 years of advanced training at an institute accredited by organizations such as the American

120

Brunette-CH09.indd 120

10/9/19 11:03 AM

Abuses of Logic

Dental Association (ADA) Council on Dental Education. Recognized dental specialties include endodontics, oral and maxillofacial surgery, oral pathology, orthodontics, pediatric dentistry, oral pathology, periodontics, prosthodontics, and public health dentistry. Yet, some dentists claim specialty status based on nutrition degrees from correspondence schools, certificates from organizations with no scientific standing, continuing education courses, or alleged expertise in unrecognized fields—such as “holistic dentistry” or “amalgam detoxification.”

Controversial care Controversial care refers to treatment modalities that have either been tried and found wanting—such as Sargenti root canal therapy—or to treatments that have not been tested sufficiently to ascertain their usefulness. Temporomandibular disorders (TMDs) have been described as dentistry’s hottest area of unorthodoxy and complete quackery. Mohl et al55 have reviewed some of the devices used for the diagnosis and treatment of TMDs and have concluded that there is no evidence to support the use of surface electromyography or silent-period duration for the diagnosis of TMDs and that sonography and Doppler ultrasound have no particular advantage over conventional stethoscopes or direct auscultation.

Other problem areas Implants, bonding, and tooth bleaching can be useful treatments, but they are occasionally performed by inappropriately trained personnel. For example, in my view, an implant placed by a dentist with only one short training course (possibly as little as 1 day) has a worse prognosis than one placed by an oral surgeon or periodontist.

Blatant quackery Dodes and Barrett give examples of practices that have no scientific basis whatsoever but that, nevertheless, are employed by quacks to treat dental problems: • Reflexology. Involves pressing on hands or feet to relieve pain and to remove underlying cause of disease in other parts of the body. • Cranial osteopathy. Based on the view that “manipulation” of the skull’s bones can cause “the energy of life” to flow to cure or prevent a wide variety of health problems, including pain—especially pain associated

with TMDs. Quite apart from the pseudoscientific theoretical basis, cranial osteopathy is practically flawed because the bones of the skull are sutured together, and the manipulations performed by cranial osteopaths do not produce movements sufficiently large to be detected with sensitive instruments. • Silver amalgam replacement. Has been advocated to avoid mercury toxicity; however, billions of amalgam fillings have been used successfully, and fewer than 50 cases of allergy have been reported in the scientific literature since 1905.53 The ADA reported that no credible evidence exists to show that mercury in dental amalgam, when used in nonallergenic patients, constitutes a general health hazard or is related to any specific disease. • Nutrition quackery and holistic dentistry. Occur under the guise of nutritional counseling and unnecessary dietary supplements, including herbal remedies, which are advocated and sold. Greene56 believes that many of these treatments are not directed at dental disease, but rather at the overall condition of patients. When applied by dentists, such treatments push dentistry into unorthodox medical care, which, as exemplified by herbal medicine, can endanger patients.57,58 • Applied kinesiology. Practiced mostly by chiropractors, it is a technique used to assess nutritional status on the basis of muscles’ response to mechanical stress. Operationally, the approach typically involves the practitioner pushing down on a patient’s outstretched arms before and after vitamins or other substances are placed under the patient’s tongue. Treatment can consist of expensive vitamin supplements or a special diet. A controlled test found that applied kinesiology was no more useful than random guessing in evaluating nutrient status.59 • Auriculotherapy. Acupuncture of the ear, based on the notion that stimulating various points on or just beneath the skin can balance the “life force” and enable the body to recover from disease. Some proponents claim that it is effective against dental pain.

Abuses of Logic Some traditional fallacies of informal logic Several traditional fallacies of informal logic may be considered arguments that are valid in that the

121

Brunette-CH09.indd 121

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

conclusion is logically related to the premises but that are unsound in the sense that at least one of the premises is wrong.60 It is important to note that the premises and the conclusions of these fallacies are not necessarily wrong. Other common fallacies occur because we do not focus directly on the points of issue but rather on circumstances that are more or less irrelevant. Some examples of traditional fallacies follow; more can be found in texts61,62,63 that cover the topic. An advantage in remembering these fallacies and the reasoning underlying their fallacious status is that the names represent a shorthand approach to critiquing arguments that can be called upon to be deployed quickly in the heat of discussion. But be careful; Tindale63 notes that many of the fallacies are reasonable forms of argument and cannot be dismissed automatically without detailed examination of the case in question.

Ad hominem and the fallacy of origin The fallacy of origin occurs when we reject an argument because it comes from an unreliable source. This fallacy can be considered in some ways the opposite of the argument from authority, which is based largely on the reliability of the source. The force of an argument, however, should not lie in its source, but rather on its premises and logic. For example, we may be tempted to dismiss studies funded by manufacturers, but to do so indiscriminately would be erroneous, because some manufacturer studies are sound and introduce valuable innovations into, for example, oral care. In such instances, the key is to examine the studies carefully and to identify their strengths and weaknesses. A variant on the fallacy of origin is ad hominem, or personal attack, whereby the person making the argument, rather than the argument itself, is criticized. Ad hominem fallacy occurs frequently in heated discussions. The tactic can be likened in some situations to a sledgehammer, but it can be quite a nuanced affair depending on circumstances, including the form in which it is posed (eg, a question, “Have you stopped beating your wife?”). Walton62 lists seventeen critical questions for ad hominem arguments.

Special pleading The fallacy of special pleading occurs when we refuse to apply to ourselves principles that we apply to others. For example, a scientist acting as a reviewer for a journal might refuse to allow a statement in a paper because the data underlying the statement were not

statistically significant. In his own paper, the scientist may not adopt the same stringent criteria and publish statements that are based on nonstatistically significant trends in the data. Any reason offered for this disparity of standards, such as difficulty in performing experiments or shortage of patients, might be considered special pleading.

Tu quoque (you also) When under attack for lacking evidence to support their claims, quacks use the tu quoque argument— agreeing that there may not be evidence to support their claims but noting that much of established medicine or dentistry also lacks evidence. As with the cliché “two wrongs don’t make a right,” the lack of evidence for some established medical treatments in no way supports the use of quack therapies. Rather, it is an argument for more research. Typically, quacks cite out-of-date or cherry-picked studies, not representative of all available information, to support the tu quoque argument; the estimate that only 10% to 20% of medical treatments have an established base has been dismissed as no more than a modern medical myth.64 Goldacre18 estimated that the evidence-based activity in the real-world hospital varies with specialty but is found in the range of 50% to 80%. Clearly this observation suggests that more research is required to establish evidence-based protocols, but it in no way suggests that quacks and modern medical procedures have a similar evidentiary basis.

Fallacy of composition The fallacy of composition occurs when we believe that what is true of all of the parts is true of the whole. This is sometimes untrue. For example, it is said (at least in a logic text) that while an individual baboon is cowardly, a group of baboons is extremely aggressive (but personally I would not act on this assumption by challenging an individual baboon).60 On the other hand, it is often true that the whole shares the characteristics of its parts. For example, individual cells are poisoned by cyanide, and an individual, composed of many cells, also is poisoned by cyanide. In fact, much of modern biology is based on the principle of reductionism—studying simpler aspects of an organism to get information about the whole. However, there is the danger of missing something because of interrelationships of the parts. An example can be found in the study of chemical carcinogenesis. It appears that some

122

Brunette-CH09.indd 122

10/9/19 11:03 AM

Abuses of Logic

compounds must be metabolized before they become carcinogenic. A once-accepted theory that correlated chemical structure of organic compounds with their carcinogenicity ignored this possibility and has since been discredited.

Fallacy of division This questionable argument occurs when someone argues that each of the parts shares the property of the whole. For example, Harvard would be widely considered to be an excellent academic institution, but that fact does not prove that every professor at Harvard is an excellent academic. Some quacks make use of the halo they may have acquired while putting some time in at an excellent institution, but mere time spent does not equate to acquiring the skills of other practitioners and the prestige of the institution.

Ad populum Occasionally, statements are justified by stating that “there is widespread agreement,” a justification based on the suppressed premise that most people are right. The legitimate use of this reasoning process occurs in inductive reasoning, where the people concerned are “reliable authorities.” Ad populum as a persuasive tool builds on the principle of social validation: the innate tendency of most people to value belonging to a group and over time changing their beliefs and behavior to conform to group norms. Thus quacks are quick to emphasize the popularity of their treatments. However, what one really wants is objective evidence that the treatment works. I once engaged in an argument with a friend over the performance of a prime minister; my opponent answered my points on the Prime Minister’s flawed performance with the statement with “he’s still very popular” and cited a recent poll to prove it, but the issue wasn’t his popularity; it was his competence.

Genetic fallacy A genetic fallacy occurs when it is assumed that one of several necessary conditions is the sole and exclusive cause of an effect. This fallacy has occurred in cancer research. As each new area of biochemistry develops, tumor cells are examined closely, and alterations (as compared with “normal” cells) are found. Then it is proposed that the alteration is the cause of the tumor. In most cases, a tumor cell without the alteration is eventually found; such an observation destroys the credibility of that particular alteration

as a necessary condition for cancer. Thus, 70 years ago, changes in oxidative enzymes were implicated; in the 1970s, there was extensive interest in altered membrane glycoproteins and/or glycolipids; in the early 1980s, researchers focused on the cytoskeleton; and in the late 1980s, scientists suspected oncogenes. Hopefully, some of these bandwagons will eventually prove useful in cancer treatment. One example would be breast cancer, where knowledge of genotype has been used to specify particular treatments.

Post hoc Post hoc is taken from the Latin expression post hoc ergo propter hoc, which, literally translated, means “after this, therefore because of this.” In chapter 8, Hume’s49 three criteria for cause and effect are given, but this fallacy would state that only one of them (temporal priority) is necessary. The post hoc fallacy has been of prime benefit to quacks and producers of patent medicines. Someone becomes ill, takes a medicine, and gets better; the individual attributes his or her well-being to the medicine, rather than to a healing process initiated by his or her own body. As someone said of vitamin C, “with it you can cure your cold in 7 days, whereas without it, it will take you a whole week to get better.” But the post hoc fallacy is not restricted to the laity; many clinical trials of drugs performed by clinicians are similarly affected. In some instances, an overt statement of the reasoning used is not made but only implied.

Apples and oranges Often, people try to compare results obtained in one system with results expected to be obtained in a different system using analogy. If the aspects in which the two systems differ are relevant conditions, then the person is said to be comparing apples and oranges. This criticism usually precipitates an argument about which conditions are relevant. The fallacy occurs frequently in the extrapolation of results from experimental animals to humans. The use of saccharin was banned in the United States, partially on the basis of observations of Canadian scientists, who found that massive doses of saccharin produced tumors in rats. Rather than banning saccharin, a US senator is said to have suggested that a statement be placed on the package: “Warning: Canadians have determined that saccharin is detrimental to your rat’s health.” In the field of clinical pharmacology, for example, it is difficult to prove that a given chemical will have the same toxicity or efficacy

123

Brunette-CH09.indd 123

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

in humans as in experimental animals, but there is much evidence to suggest that, in many instances, this hypothesis is reasonable.

Circular reasoning Circular reasoning occurs when an alleged proof of a statement eventually involves the assumption of the statement being proved. Circular reasoning can remain undetected for long periods. For example, one of the most often quoted “proofs” of Maxwell’s laws of the distribution of molecular velocities assumes one of the features of the final law that it attempts to prove.65 If circular reasoning is suspected, the best place to look for it is in the definitions, particularly if the definition in question is odd. Circular reasoning is also called begging the question, since it begs (assumes) the very thing it tries to prove. Walton62 notes that an argument that begs the question is doomed from the start as a persuasive proof for it does not advance the argument beyond the premises.

UFO fallacy Unidentified flying objects are just that: unidentified. The question becomes whether they represent some form of visitation by extraterrestrial beings. Users of this fallacy generally argue their case by pointing out what individual sightings were not (it was not a weather balloon; none were released in that area). Chapter 6 defined the disjunctive syllogism as a valid form of deductive argument in which, given only two alternatives, when one is known to be false, the other must be true. Clearly, this technique, which could be called proof by elimination, can be effective whenever there is a finite number of possibilities. The problem with the UFO theory is that there is a potentially large number of causes for the phenomena, which may be grouped under the heading “unexplained natural occurrences.” Faced with such an unlimited list, the disjunctive syllogism becomes ineffective. The UFO fallacy occurs where proof by elimination is attempted without adequate consideration of all the possible alternatives.66 In proof by exclusion, clinical diagnosis can proceed by means of the disjunctive syllogism. For example, if, on routine dental examination, a middle-aged woman were found to have a radiolucent area near the apex of a mandibular anterior tooth, two possibilities might come to mind: (1) dental infection, ie, a nonvital tooth; or (2) periapical cemental dysplasia. The dentist would probably test the teeth for vitality. If the teeth were vital, the dentist could exclude the

first possibility—dental infection—and conclude that the most probable diagnosis was periapical dysplasia. However, to do so would run the risk of committing the UFO fallacy, for there are other possibilities—such as a midline cyst of the mandible or a central mucoepidermoid carcinoma—that also fit the facts of the case. According to Murphy,67 it is largely a myth that you can exclude some diseases as impossibilities and that, by default, the diagnosis must be something else. One difficulty is that there may be a problem with the sensitivity of the test; vitality testing, for example, does give false-positive results. But a far greater problem with proofs by exclusion is that, in practice, they rarely exclude, because it is difficult to prove that the list of possibilities is exhaustive.

Fallacies of clinical dentistry Greene68 has noted several fallacies in the evaluation of clinical treatment procedures. Several of these arise from the nature of inductive logic, in particular, from failure to consider alternative explanations. Green classifies these potential problems into the fallacies of success and failure.

Fallacies of success The following factors have to be considered before reaching any conclusion about a treatment’s effectiveness: • Spontaneous remission. Because some conditions are self-limiting, spontaneous remission is an alternative explanation of the supposed success of a clinical procedure. Individuals usually enter treatments when their symptoms are at their worst, and some symptoms naturally become less severe, regardless of whether therapy is instituted. • Placebo response. When dealing with pathologic pain, about 35% (or perhaps a higher percentage) of the population responds favorably to a placebo, regardless of whether their problem is real or imagined, and this rate increases when a strong positive suggestion is added. The placebo effect is a plausible alternative hypothesis for many successes, and the rate of clinical success for any treatment must be compared with the placebo response for that particular problem. • Multiple variables of treatment. Often, dentists devise treatments composed of several parts applied either together or in sequence. However, from Mill’s canons of induction, we know that you can only assess the effectiveness of each component by varying one

124

Brunette-CH09.indd 124

10/9/19 11:03 AM

References

factor at a time. Thus, in a complex procedure, it is difficult to determine whether all parts of the treatment are essential or if some are meaningless. Irrelevant variables may be credited for the success, while the critical variable is overlooked. Greene68 cites the example of pontics in fixed partial dentures, where research attention focused on the compatibility of various materials (eg, gold, porcelain, and acrylic) with the edentulous ridge tissues. After many studies, researchers concluded that it was not the material, but the design of the pontic and its cleansability, that determined tissue response. • Treatment of nonexistent problems. According to Greene, 68 the most common treatment error is overtreatment. If there is no problem at the onset of treatment, and if the treatment does no harm (Hippocrates’ first rule of medicine), there will be no problem after treatment, and the procedure will be classified as a success. Greene believes that some extensive reconstruction procedures or certain prophylactic measures—such as preventive equilibration—are applied to patients who could be helped with a more conservative approach or who need no treatment at all. • Short-term successes but long-term failures. Some procedures that seem successful at the time of treatment may fail later. For example, a successful pulpotomy done with calcium hydroxide in a permanent tooth may fail years later and result in dystrophic changes in the canals, which preclude root canal therapy. Some treatments may also produce side effects. This principle is recognized in pharmacology, where both efficacy and safety of drugs are evaluated prior to introduction of a drug to the market, but—as shown by the problems with intrauterine devices—the same kinds of rigorous standards are not always applied to nonpharmacologic treatment procedures or devices.

certain problems, which leads them to analyze all patients with those problems in a narrow frame of reference. The clinician who attributes myofascial pain dysfunction problems solely to occlusal disharmony may fail to help myofascial pain dysfunction patients with stress problems, and perhaps may make their situation worse. • Multifactorial problems. Problems can be caused by more than one factor. For example, if a patient has a necrotic pulp in a tooth that also has a periodontal problem, conventional periodontal therapy may fail without concomitant root canal therapy. Such a failure would not result from the periodontal therapy being an ineffective treatment, but rather from the complex nature of the problem. • Methodologic problems. Normally effective treatments might fail because of defective methods, such as improper execution of the treatment, lack of patient cooperation, premature evaluation, and even the patient’s psychologic state. The latter phenomenon is exemplified in severely depressed patients, who simply do not respond well to conventional treatment of their physical problems. • Insufficient statistical power. Treatments can fail to show a statistically significant effect because there are too few participants in the experiment to either establish or conclusively rule out the existence of effects of clinically meaningful magnitude.

Fallacies of failure

4.

The fallacies of failure are related to the problem of negative results. There are always a host of possible reasons why success was not seen, including the following: • Wrong diagnosis. An incorrect diagnosis can lead to the use of an unsuitable treatment. • Incorrect cause-effect correlations. If the cause of the clinical problem is unknown or is incorrectly identified, a clinician might expect the treatment to fail. Greene68 states that this type of error usually is made by clinicians who have a theoretical bias about

References 1. 2. 3.

5. 6.

7.

8.

9.

Macaulay TB. The Miscellaneous Writings of Lord Macaulay, vol 2. London: Longman, Green, Longman, and Roberts, 1860. Swales J. The troublesome search for evidence: Three cultures in need of integration. J R Soc Med 2000;93:553. Rangachari PK. Evidence-based medicine: Old French wine with a new Canadian label? J R Soc Med 1997;90:280–284. Kassirer JP. The quality of care and the quality of measuring it. N Engl J Med 1993;329:1263–1265. Woolf CJ, Max MB. Mechanism-based pain diagnosis: Issues for analgesic drug development. Anesthesiology 2001;95:241–249. Esposito M, Coulthard P, Thomsen P, Worthington HV. Enamel matrix derivative for periodontal tissue regeneration in treatment of intrabony defects: A Cochrane systematic review. J Dent Educ 2004;68:834–844. Gottlieb S. Modernizing development science to unlock new treatments. Presented at the American Enterprise Institute, Washington DC, 7 Feb 2006. Redwood D. Chiropractic. In: Micozzi MS (ed). Fundamentals of Complementary and Alternative Medicine. New York: Churchill Livingstone, 1996:91–110. Brunette DM. Alternative therapies: Abuses of scientific method and challenges to dental research. J Prosthet Dent 1998;80:605– 614.

125

Brunette-CH09.indd 125

10/9/19 11:03 AM

QUACKS, CRANKS, AND ABUSES OF LOGIC

10. Linde K, Clausius N, Ramirez G, et al. Are the clinical effects of homoeopathy placebo effects? A meta-analysis of placebo-controlled trials. Lancet 1997;350:834–843. 11. Shang A, Huwiler-Müntener K, Nartey L, et al. Are the clinical effects of homoeopathy placebo effects? Comparative study of placebo-controlled trials of homoeopathy and allopathy. Lancet 2005;366:726–732. 12. The end of homeopathy [editorial]. Lancet 2005;366:690. 13. Kuhn TS. Normal science as puzzle-solving. In: Kuhn TS (ed). The Structure of Scientific Revolutions, ed 2. Chicago: University of Chicago, 1970:35–42. 14. Haack S. Manifesto of a Passionate Moderate: Unfashionable Essays. Chicago: Univerisity of Chicago, 1998. 15. Wootton D. Bad Medicine: Doctors Doing Harm Since Hippocrates. Oxford. Oxford University, 2007. 16. Gardner M. Fads and Fallacies in the Name of Science. New York: Dover, 1957. 17. Gardner M. Science: Good, Bad, and Bogus. Buffalo: Prometheus, 1981. 18. Goldacre B. Bad Science. London: Fourth Estate, 2009, 19. Bausell RB. Snake Oil Science: The Truth about Complementary and Alternative Medicine. Oxford: Oxford University, 2007. 20. Medawar PB. The Limits of Science. Oxford: Oxford University, 1984:18. 21. Yiamouyiannis J. Lifesaver’s Guide to Fluoridation. Delaware: Safe Water Foundation, 1983. 22. Wulf CA, Hughes KF, Smith KG, Easley NW. Abuse of the Scientific Literature in an Antifluoridation Pamphlet, ed 2. Columbus: American Oral Health Institute, 1988. 23. Grace M. Facts on fluoridation. Br Dent J 2000;189:405. 24. Jarvis W. Dangerous fleece. The Scientist. February 4 1989:Letters. 25. Wetli CV, Davis JH. Fatal hyperkalemia from accidental overdose of potassium chloride. JAMA 1978;240:1339. 26. Oseas RS, Phelps DL, Kaplan SA. Near fatal hyperkalemia from a dangerous treatment for colic. Pediatrics 1982;69:117. 27. Zwicky JF, Hafner AW, Barret S, Jarvis WT. Readers Guide to Alternative Methods. Milwaukee: American Medical Association, 1993. 28. Skrabanek P, McCormick J. Follies and Fallacies in Medicine. Buffalo: Prometheus, 1990. 29. National Center for Complementary and Integrative Health, National Institute of Health. NCCIH Facts-at-a-Glance and Mission. https://nccih.nih.gov/about/ataglance. Accessed 2 August 2018. 30. Eisenberg DM, Kessler RC, Foster C, Norlock FE, Calkins DR, Delbanco TL. Unconventional medicine in the United States. Prevalence, costs, and patterns of use. N Engl J Med 1993;328:246. 31. National Center for Complementary and Alternative Medicine. A new portrait of CAM use in the United States. NCCAM Newsletter 2004;11(343). 32. National Center for Complementary and Alternative Medicine, National Institute of Health. 2006 Congressional Justification. https://nccih.nih.gov/sites/nccam.nih.gov/files/about/budget/ congressional/2006.pdf. Accessed 2 Aug 2018. 33. Marcus DM, Grollman AP. Science and government. Review for NCCAM is overdue. Science 2006;313:301–302. 34. Offit PA. Do You Believe in Magic? The Sense and Nonsense of Alternative Medicine. New York: Harper Collins, 2013:97. 35. Hempel CG. The test of a hypothesis: Its logic and its force. In: Hempel CG (ed). Philosophy of Natural Science. Englewood Cliffs, NJ: Prentice Hall, 1966:19–46.

36. Crelin ES. A scientific test of chiropractic theory. Am Sci 1973;61:574–580. 37. Skrabanek P. Acupuncture: Past, present, and future. In: Stalker D, Glymour C (eds). Examining Holistic Medicine. Amherst: Prometheus, 1984:181–196. 38. Jacobs J, Moskowitz R. Homeopathy. In: Micozzi MS (ed). Fundamentals of Complementary and Alternative Medicine. New York: Churchill Livingstone, 1996:67–78. 39. Ziman J. Reliable Knowledge: An Exploration of the Grounds for Belief in Science. Cambridge: Cambridge University, 1978. 40. Guruswamy K. Wishing on a fish; thousands of Indian asthma sufferers line up for herb-sardine cure. Associated Press. 9 Jun 1997. 41. Beecher HK. Pain, placebos and physicians. Practitioner 1962;189:141. 42. Roberts AH, Kewman DG, Mercier L, Hovell M. The power of nonspecific effects in healing: Implications for psychosocial and biological treatments. Clin Psychol Rev 1993;13:375. 43. Benedetti F, Mayber HS, Wager TD, Stohler CS, Zubieta JK. Neurobiological mechanisms of the placebo effect. J Neurosci 2005;25:10390–10402. 44. Beck FM. Placebos in dentistry: Their profound potential effects. J Am Dent Assoc 1977;95:1122. 45. Kaptchuk TJ, Stason WB, Davis RB, et al. Sham device v inert pill: Randomised controlled trial of two placebo treatments. BMJ 2006;332:391–397. 46. Welch JS. Ritual in Western medicine and its role in placebo healing. J Relig Health 2003;42:21–31. 47. Gryll Sl, Katahn M. Situational factors contributing to the placebo’s effect. Psychopharmacology (Berl) 1978;57:253–261. 48. Gracely RH, Dubner R, Deeter WR, Wolskee PJ. Clinicians’ expectations influence placebo analgesia. Lancet 1985;1:43. 49. Hume DA. A Treatise of Human Nature (1739–1740), ed 2, vol 1. Oxford: Clarendon, 1978:173–175. 50. Needleman I, Tucker R, Giedrys-leeper E, Worthington H. Guided tissue regeneration for periodontal intrabony defects—A Cochrane Systematic Review. Periodontol 2000 2005;37:106–123. 51. Freirich EJ. Unproven remedies: Lessons for improving techniques of evaluating therapeutic efficacy. In: MD Anderson Hospital and Tumor Institute (ed). Cancer Chemotherapy: Fundamental Concepts and Recent Advances. Chicago: Year Book Medical, 1975:385–401. 52. Cornacchia HJ, Barrett S. Consumer Health: A Guide to Intelligent Decisions. St Louis: Mosby, 1989:44–45. 53. Berry JH. Questionable care: What can be done about dental quackery? J Am Dent Assoc 1987;115:681. 54. Dodes JE, Barrett S. Dubious Dental Care. New York: American Council on Science and Health, 1991. 55. Mohl ND, Lund JP, Widmer CG, McCall WD Jr. Devices for the diagnosis and treatment of temporomandibular disorders. Part II: Electromyography and sonography. J Prosthet Dent 1990;63:334. 56. Greene CS. Holistic dentistry: Where does the holistic end and the quackery begin? J Am Dent Assoc 1981;102:25. 57. Tyler VE. Hazards of herbal medicine. In: Stalker D, Glymour C (eds). Examining Holistic Medicine. Buffalo: Prometheus, 1985:323. 58. Interlandi J. Supplements can make you sick. In: Supplements: A Complete Guide to Safety. Consumer Reports Sept 2016:20– 33.

126

Brunette-CH09.indd 126

10/9/19 11:03 AM

References

59. Kenney JJ, Clemens R, Forsythe KD. Applied kinesiology unreliable for assessing nutrient status. J Am Diet Assoc 1988;88:698. 60. Capaldi N. The Art of Deception. New York: Brown, 1971:102–103. 61. Fearnside WW, Holther WB. Fallacy: The Counterfeit of Argument. Englewood Cliffs, NJ: Prentice Hall, 1959. 62. Walton D. Informal Logic: A Pragmatic Approach, ed 2. Cambridge: Cambridge University, 2008. 63. Tindale CW. Fallacies and Argument Appraisal. Cambridge: Cambridge University, 2007. 64. National Council Against Health Fraud (NCAHF). The making of a modern medical myth: “Only 10–20% of medical procedures are proved.” NCAHF Newsletter 1995;18(6).

65. Wilson EB. An Introduction to Scientific Research. New York: McGraw Hill, 1952:34. 66. Giere RN. Understanding Scientific Reasoning. New York: Holt, Rinehart and Winston, 1979:149–155. 67. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University, 1985:25. 68. Greene CS. The fallacies of clinical success in dentistry. J Oral Med 1976;31:52.

127

Brunette-CH09.indd 127

10/9/19 11:03 AM

10 Elements of Probability and Statistics, Part 1: Discrete Variables



Probability does pervade the universe—and in this sense the old chestnut about baseball imitating life really has validity. The statistics of streaks and slumps, properly understood, do teach an important lesson about epistemology and life in general. The history of a species, or any natural phenomenon that requires unbroken continuity in a world of trouble, works like a batting streak. All are games of a gambler playing with a limited stake against a house with infinite resources. The gambler must eventually go bust.”

Probability and Distributions Statistics is the scientific methodology used to make probability statements; it is generally divided into two types: (1) descriptive statistics involves gathering, displaying, and summarizing data, and (2) inferential statistics is the science of drawing conclusions from specific data. These conclusions can be helpful in making decisions such as how to bet in card games or on elections. Statisticians first determine how the data are distributed—that is, what and how many data points have particular outcomes or scores—and then use that information to make statements about probability of specified occurrences. For example, in interpreting the results of a clinical trial for any treatment, the question is whether the treatment had any effect or whether differences between treated and control groups happened by chance. Certain distributions of data, such as the normal distribution and the binomial distribution, feature prominently in scientific research and can be described mathematically. Knowing that data follow such distributions enables statisticians to make statements about particular situations, such as the probable value of the average of a large population based on a single sample. Statistical statements have an element of quantified uncertainty. For example, most readers are familiar with election opinion polls, of which a typical result might state that a candidate is favored by 51% of the voters with a margin of error of 4%. Statistics is grounded in probability theory; this chapter starts with the laws of probability, shows how these yield particular distributions of data, and examines how statisticians make inferences from data and distributions. Typically, the inferences of scientific interest include how well the results conform to a theory, whether there are significant differences between groups (such as a treated group and a control), and estimation of values and differences.

STEPHEN JAY GOULD 1 128

Brunette-CH10.indd 128

10/21/19 10:26 AM

Probability and Distributions

Probability theory was conceived to satisfy gamblers’ desire to calculate odds of outcomes of various events, such as rolling dice or being dealt certain cards. The 17th-century mathematicians Blaise Pascal and Pierre de Fermat are widely regarded as the founders of a general theory of probability. Pascal’s curiosity about probability was prompted by a gambling friend, the Chevalier de Mere, who had discovered through personal experience that a seemingly well-established gambling rule led to unsatisfactory results and sought Pascal’s help. Pascal ultimately carried his interest in probability into the realm of the “Four Last Things”— death, judgment, heaven, and hell—and concluded that to live one’s life as a devoted Christian was a good wager (henceforth referred to as Pascal’s Wager), reasoning that the small bet entailed in the costs of living a Christian life paled in comparison to the infinite rewards of heaven.2 My approach here, however, will be limited to more mundane issues. Probability3 is a term that has different meanings depending on the context in which it is used. Three approaches follow: 1. Classical probability, or a priori approach, was developed for making intelligent wagers, and it applies the fundamental assumption (rather than empirical evidence) that the game is fair (for example, that a fair die is equally likely to show 1, 2, 3, 4, 5, or 6). 2. Frequentist probability—the prevailing approach used in science—was formally defined by the mathematician John Venn (of the diagrams), who stated in 1872 that the probability of a given event is the proportion of times the event occurs over the long run. 3. Subjective or personal probability judgments (as in “I think it’s 50:50 that the Cardinals will win the World Series this year”) are often the only tools available for complex predictions about the future, because obtaining the type of data needed for a frequentist approach is impossible, and a strong theoretical base for making any prediction is lacking. Improbably perhaps, this approach forms the first step in Bayesian statistics, which has found wide application in dentistry and medicine (see chapters 15 and 21).

Probability calculations To make probability calculations, several rules must be observed:

• Probabilities always take a value between 0 (an impossible event) and 1 (an inevitable or totally certain event). An event is defined as a set of elementary outcomes. In turn, elementary outcomes are the possible results of a random experiment, which, for probability purposes, is defined as the process of observing the outcome of a chance event. The sampling space is the set of all elementary outcomes. In probability problems, one often tries to determine the probability of combined events, such as: event A occurs and event B occurs; either event A or event B occurs; or neither event A nor event B occurs. In tackling probability problems, one normally divides the number of outcomes of interest with the total number of outcomes, ie, the sampling space. For example, in the roll of two fair dice, the sample space comprises 36 elementary outcomes (each of the six faces of one die can be combined with each face of the second die), of which all are equally likely. The event that the total of the two dice equals 2 can occur in only one way—when both dice show 1—so its probability is 1/36. In contrast, the event that the two dice total 4 can occur in three ways: Die 1 Die 2 1 + 3 3 + 1 2 + 2 Therefore, its probablility is 3/36 • The addition rule for mutually exclusive events. If events are mutually exclusive, they cannot occur together. The probability of one or the other occurring is the sum of their individual probabilities. Thus, the probability of throwing either a 3 or a 1 on the single roll of a die is: P(1 or 3) = P(1) + P(3) = 1/6 + 1/6 = 1/3 This outcome can be illustrated in a Venn diagram (Fig 10-1), where probability is indicated by area. The probability of the total sample space area is 1 (ie, it is certain that one face of the die will turn up—the die will not balance on an edge). The area for each outcome represented by circles is 1/6, and the sum of the areas within the circles is 1/3 (1/6 + 1/6). Thus, the addition rule is applied to determine the likelihood that any one of several mutually exclusive events will happen. • The general addition rule for nondisjoint outcomes. Nondisjoint outcomes are outcomes that have elements in common. Suppose a dental school employs 200 faculty. Of these, 30 hold dental degrees (P[DDS] =

129

Brunette-CH10.indd 129

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

Throw a 1 (P = 1/6)

Throw a 3 (P = 1/6)

5 PhDs

5 PhDs and DDSs

25 DDSs

Area outside circles = 4/6 Fig 10-1 Diagram illustrating the probability of throwing a 1 or a 3 (ie, two mutually exclusive events).

30/200 = .15) and 10 hold PhD degrees (P[PhD] = 10/200 = .05). Upon closer examination it is found that 5 people hold both a dental degree and a PhD degree. To estimate the probability that a randomly selected employee has a PhD or a dental degree, therefore, it would be necessary to avoid counting twice those individuals who hold both degrees (Fig 10-2). That is, it would be inaccurate to count the total number of PhD degrees (10) and the total number of dental degrees (30) and arrive at a value of 40/200, yielding P = .2, because the 5 individuals holding both degrees would be counted twice. The general addition rule for nondisjoint outcomes adjusts the probability by subtracting those who have both degrees, as follows: P(PhD or DDS) = P(PhD) + P(DDS) – P(DDS and PhD) = (10/200) + (30/200) – (5/200) = 35/200 = .175 • The multiplication rule for a series of independent events. Independent events are events where the occurrence (or not) of one is completely separate from the occurrence (or not) of the other. Using a fair die, the probability of each face turning up on a second roll is not influenced by the number that turned up on the first roll. The multiplication rule states that the probability that two independent events will occur is the product of their individual probabilities. Thus, the probability of throwing a 1 on the first roll of the die and a 3 on the second roll of the die is: P(1) × P(3) = 1/6 × 1/6 = 1/36 • Conditional probability. Conditional probability refers to the probability that one event will occur if some other event has already occurred; it is generally expressed as P(X|Y) (the probability of event X given that event Y has already occurred). If two dice are

Fig 10-2 Example of nondisjoint outcomes. Probability of PhD or DDS degrees must be adjusted by subtracting those faculty who have both.

thrown sequentially and the result of the first throw is a 1, then the probability that a 1 and a 3 will be thrown equals the probability that a 3 will be thrown on the roll of the second die. The sample space would be 36 if we did not know that a 1 had already been thrown. However, only 6 outcomes are now possible, representing the sample space, and therefore the probability is 1/6 (P[3|1] = 1/6). Conditional probabilities are particularly important in the interpretation of diagnostic tests, which are discussed in chapter 15.

Distributions of outcomes A random variable is defined as the numeric outcome of a random experiment. Consider the random variable X = sum of the dots on two dice. As noted earlier, there are 36 possible outcomes, but because some values of X can be obtained in more than one way, the total number of values is only 11. The value 4, for example, can be obtained by a 3 on the first die and a 1 on the second; by a 1 on the first die and a 3 on the second; and by both dice showing a 2. Similar calculations could be done for all of the possible outcomes, yielding a probability distribution for X. To distinguish the random variable X from the particular values of the total on the dice, we will label the particular values x. Thus, a table can be constructed (Table 10-1). Two aspects of this table should be noted. First, the total sum of all probabilities equals 1. This makes sense because when two dice are thrown, only the values of x shown in the table can be obtained, and x must have some value since no die will land and balance on its edge. Second, X may be called a discrete variable in that it is made up of a finite number of values. Note that one cannot get a sum of the dice equal to 3.44; only integers can be obtained by summing the number of dots. This probability distribution can be drawn as a relative frequency histogram (Fig 10-3), with the ordinate

130

Brunette-CH10.indd 130

10/21/19 10:26 AM

Statistical Inference

Table 10-1 | Probability distribution for the random variable X = sum (no. of dots on first die) + (no. of dots on second die)* Value of x

Probability that X = x

2

3

4

5

6

7

8

9

10

11

12

1/36

2/36

3/36

4/36

5/36

6/36

5/36

4/36

3/36

2/36

1/36

*x = particular values of X, which can vary from 2 to 12.

7/36 6/36 P (X = x)

given by the probability and—by convention—with the bars of the relative frequency centered on the outcome values. Also by convention, the total area of the histogram is assigned a value of 1. By following these conventions, we can compare the relative probabilities for specified outcomes simply by comparing the areas of the bars representing those outcomes. For example, we can see that the chance of two throws totaling 6 equals the chance of two throws totaling 8. The chance of the two dice yielding a value of 2 equals the chance of them totaling 12, and so on.

5/36 4/36 3/36 2/36 1/36 0

2

3

4

5

6

7

8

9

10 11

12

Value of x Fig 10-3 The probability of the value of the sum of the dots of two dice. The most probable is seven.

Statistical Inference Frequency distributions can help us make decisions. In light of the frequency distribution in Fig 10-3, for example, it would be reasonable to accept an even money bet that the total number of dots on two dice would total 7 more times than they would total 6. If the dice followed the classic probabilities, we could expect that in 36 throws of the dice, the total 6 would show up five times whereas the total 7 would show up six times. Over several throws of the dice, the person betting on 7 would win more times than the person betting on 6. The preceding example is based on classical probability, but frequency distributions also can be generated by actual events or experiments. To demonstrate, we could throw pairs of dice again and again, each time recording the sum of the dots. Intuitively, one would expect the empirical distribution to more closely match the classical (a priori) distribution as the number of throws increased. Figure 10-4 shows the results of a computer simulation in which two dice were thrown 10, 100, 500, and 5,000 times, respectively. By recording the results in the form of a frequency histogram—that is, the number of times each of the possible totals occurred—we can see that the experiment representing 5,000 throws yielded results closer to the classical (a priori) distribution than did the lower number of throws.

An example of statistical inference: Hypothesis testing In the preceding sections on logic, we assumed that test predictions of the hypothesis could readily be found true or false. In fact, this often is not the case; rather, statistical testing of a hypothesis requires a standard ritual as outlined by Larkin3: 1. Propose a research question. 2. State an alternative hypothesis H1 based on the research hypothesis. 3. State a null hypothesis H0. 4. Choose a level of significance. 5. Sample. 6. Choose a probability distribution appropriate to the problem. 7. Find the distribution of possible values if H0 is true. 8. Find the probability of getting the sample value or a value more extreme. 9. Decide whether to accept or reject H0. Let us consider a mock example. Suppose there are four University of British Columbia (UBC) dental graduates—A, B, C, and D—who plan to hold a reunion after each has completed 20 endodontic cases. At this reunion, they will determine how their patients have fared as compared with patients treated by other dentists in general practice.

131

Brunette-CH10.indd 131

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

35

10 throws

25 20

25

Frequency (%)

Frequency (%)

30

20 15 10 5 0

500 throws

5

2 3 4 5 6 7 8 9 10 11 12 Sum of two dice

18

16

16

14

14

12 10 8 6 4 2

5,000 throws

12 10 8 6 4 2

0 c

10

b

Frequency (%)

Frequency (%)

18

15

Fig 10-4 Observed frequency distributions of the total number of dots on two dice when thrown. (a) 10, (b) 100, (c) 500, and (d) 5,000 times. As more throws are made, the observed distribution resembles to a greater extent the expected (a priori) distribution.

0 2 3 4 5 6 7 8 9 10 11 12 Sum of two dice

a

100 throws

0 2 3 4 5 6 7 8 9 10 11 12 Sum of two dice

d

2 3 4 5 6 7 8 9 10 11 12 Sum of two dice

Step 1: Research question

Step 3: Null hypothesis

Their research question is: Does a recent UBC graduate differ from other practicing dentists in the proportion of successes in endodontic therapy? Although reports in the dental literature claim a 95% success rate for endodontic therapy, a more realistic figure for dentists in general practice would be around 80% (Dow P, personal communication, 1982).

Testing the null hypothesis is analogous to the mathematics strategy of proof by contradiction: We begin testing the null hypothesis by assuming it is true. In this example, then, the UBC graduates and other dentists are both random samples from the same target population. The null hypothesis is PUBC = POthers or, as expressed differently for a constant-sized sample, number of successes of UBC graduates = number of successes of other dentists. The question to ask is: Are the results obtained by a UBC graduate better or worse, or different in any way, from those obtained by others? To answer it, we will calculate the consequences of the null hypothesis—that is, the probability of the observed results if the null hypothesis were true. If this probability is low, we will reject the null hypothesis because it contradicts the observed evidence.

Step 2: Alternative hypothesis The alternative hypothesis is PUBC ≠ POthers where PUBC is the proportion of endodontic successes for each recent UBC graduate. POthers is the proportion of successes for other dentists, which is estimated as 0.80.

132

Brunette-CH10.indd 132

10/21/19 10:26 AM

Statistical Inference

Step 4: Significance level The dental students agree to accept a difference as being real if the possibility that their results could be explained by chance is less than 5%. This is an arbitrary decision, but it conforms to the commonly used value of 5%. By this standard we reject the null hypothesis if the observed result occurs by chance less than 5% of the time (ie, 1 time out of 20). But there are really no formal rules for setting this standard, which is called the level of significance and designated α. The commonly used level of significance of 5% was described by Fisher4 as the following: Usual and convenient for experimenters … in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. Fisher pointed out that there is always a risk of error and asserted that 5 mistakes in 100 repetitions constitutes a reasonable rate of risk. It is important to note that, relative to the decisions made in everyday life, a 5% risk of error is conservative. Brides and grooms promise to stay together “’til death do us part,” but many more than 5% of marriages end in divorce, not death. Racetracks, lotteries, and stock markets would go out of business if everyone insisted on a 95% chance of their investments increasing in value. The standard of a 1⁄20 risk of accepting error is intended to prevent any branch of science from collapsing like a house of cards built on poorly supported observations. (However, such a conservative standard probably results in some interesting and useful studies languishing in file drawers.)

Step 5: Sample The four dentists each treat a sample of 20 patients. At the reunion, the dentists report their data (Table 10-2). The question is, can these data be explained by chance? To find the answer, we must compare the observed set of facts with some expected values. The various techniques to do so constitute a subject in themselves and are taught in probability and statistics courses. In this section we are primarily concerned with the logic behind statistical tests rather than the computational details.

Table 10-2 | Successes and failures of four dentists Dentist

Successes

Failures

A

20

0

B

16

4

C

13

7

D

11

9

In all subsequent calculations, we will assume that the sample is obtained randomly and is therefore representative of the population. This assumption is common to all statistical inference, though in many instances random sampling is not possible.

Steps 6 and 7: A probability distribution One possibility is that the observed data result from chance. The goal is to determine, from probability considerations alone, the frequency of success in samples of 20 when the predicted rate of success is 80%. Step 6: Generating a distribution The particular distribution of interest for this example is called the binomial distribution, which results when there are only two choices—in this case, success or failure, with fixed probabilities of .80 and .20, respectively. One approach that could be used to calculate the frequencies of success and failure would be to rent a machine that mixes balls and selects them randomly, like those used in a bingo hall. Using white balls to represent success and black balls to represent failure, we would add them to the machine in a ratio of 80:20. This ratio is chosen because the proportion of 80% represents the success rate generally found for endodontic therapy—ie, POthers = .80. To collect the data, we would let the machine select a sample population of 20 balls representing the number of patients treated by each of the four dentists. The number of black and white balls would be tabulated, then thrown back into the machine, and another sample of 20 balls would be randomly selected, tabulated, and so on. These collected data are the reference distribution. When 1,000 random samples had been collected in this manner, a table of the results could be prepared (Table 10-3). Using this table, we could plot the reference distribution as a frequency distribution, known as a relative frequency histogram (Fig 10-5).

133

Brunette-CH10.indd 133

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

Table 10-3 | Results of random selection of a population containing 80% white ping-pong balls and 20% black ping-pong balls (sample of 20; white = success) Frequency (%)

0

0

1

0

2

0

3

0

4

0

5

0

6

0

7

0

8

0

9

0

10

0.2

11

0.7

12

2.2

13

5.5

14

10.9

15

17.6

16

21.8

17

20.6

18

13.6

19

5.9

20

1.0

Frequency (%)

No. of white balls (successes)

25 20 15 10 5 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 161718 1920 No. of white balls (successes)

Fig 10-5 Relative frequency histogram of average number of white balls in samples of 20 from a population containing 80% white balls and 20% black balls.

Some readers might remember that the formula for the binomial coefficient is the same one used in determining the number of combinations of n objects taken S at a time. Other readers might recall that another way of finding the binomial coefficient is to use Pascal’s triangle, in which each entry is the sum of the two numbers just above it: 1 1 1 1 1

Step 7: Frequency distributions In the preceding example, we generated the frequency distribution using a bingo hall apparatus to mix and select black and white balls randomly. In generating this distribution, we adopted the frequentist approach to probability, in which the probability of a given event is the proportion of times the event occurs over the long run (here 1,000 random samples). We did not need to resort to a mechanical model; a mathematical formula could have been used to calculate the frequencies (and, in fact, was used to generate the data above). In brief, the probability of success is the same for every trial. The probability of obtaining S successes in n trials is derived by the formula:

P(S) = binomial coefficient pS(1 – P)n–S The binomial coefficient is derived by the formula: ⎛n⎞ ⎝s ⎠

⎜ ⎟ =

n! s ! (n – s) !

1

3 4

1

2 3 6

1 4

1

No matter how it was generated, the graph displays two points to consider: 1. The peak of the frequency distribution—16 successes, or 4 failures out of 20—is exactly the result expected because the parent population contained 80% white balls. This peak, called the mode, is considered the most representative value. However, only 21.8% of the samples had this modal (ie, most commonly occurring) value that is characteristic of the parent population. Thus, a sample value is more likely to differ from the true population value than to represent it exactly, and this rule is true of most, but not all, sampling distributions. 2. None of the samples drawn had 0 to 5 successes. This is not surprising; any dentist would intuitively know that if he or she had only 5 successes out of 20 attempts, the results could not be explained by chance. The logic behind this intuition will now be explored.

134

Brunette-CH10.indd 134

10/21/19 10:26 AM

Statistical Inference

Table 10-4 | Results relative to frequency distribution Dentist

Success

Failure

Percentage of time this no. of successes expected by chance

Percentage of time this no. of successes or more extreme values expected by chance*

A

20

0

1.1

1.1

B

16

4

20.5

100

C

13

7

5.5

8.6

D

11

9

0.9

1.0

*Includes both tails of failure distribution.

Steps 8 and 9: Calculation and judgment Typically, investigators will report the results of statistical testing as a value of P, which is the probability of the result occurring by chance. If the investigator adopts the 5% level of significance, the statement “P < .05” means that the results are statistically significant— that is, they are unlikely to be explained by chance. We know from the experiment with the bingo balls that, because of fluctuations of random sampling, it is unlikely that the difference between the value actually observed minus the value expected from the frequency distribution will exactly equal zero. Thus, we assume that any variation from the expected value is due to chance until demonstrated to be otherwise. This assumption is reasonable. Because we know that random fluctuations always occur, there is no need to postulate any other effect if random fluctuations alone can explain the difference between the observed and the expected values. This conclusion demonstrates the philosophical principle called Occam’s razor: One should not multiply causes without reason. As noted previously, the proposition that (observed minus expected) = 0 is called the null hypothesis (often symbolized as H0). Expressed another way, it could be said that the null is H0:P = .80, where P is the probability of success in the population from which we are sampling. The observed fraction of successes (such as 13 successes out of 20) is tested for its consistency with a true underlying probability of .80. It is always tested for rejection. If the difference (observed minus expected) is too large to be explained by chance, there must be some other effect operant; the hypothesis that another effect exists is designated as H1. Thus, the decision of whether the data are consistent with the null hypothesis can be expressed as follows: (observed minus expected) = 0, ie, H0 true, only random fluctuation operant; (observed minus expected) ≠ 0, ie, H1 true, some other effect operant.

The logic is similar in form to the disjunctive syllogism: premise 1: Either H1 or H0. (where H1 is the alternative hypothesis, something happened, and results cannot be explained by chance; and H0 is the null hypothesis, which states that results can be explained by chance) premise 2: Not H0. (null hypothesis is rejected) conclusion: Therefore H1. (alternative hypothesis is true) This form of argument is not a deductive syllogism in the strict sense because the statement “not H0” is not absolute but rather probabilistic. The net result of this argument is that if we can reject the null hypothesis, we can accept the alternative hypothesis. The problem now is, how do we decide that the null hypothesis is false? For illustration purposes, let us suppose that the results from the four UBC dentists relative to their frequency on the frequency distribution are given in Table 10-4. Recall that the null hypothesis in this case is that there is no difference in the proportion of successes found in their patients and the proportion of successes for dentists generally. Therefore: Observed – Expected =0 No. of successes No. of successes of recent UBC graduates based on experience of all dentists We can see that the results of Dentist A, who had 20 successes, are hard to explain by chance because 20 successes in 20 trials would be expected to occur only 1.1% of the time if the sample were drawn from a population with an 80% success ratio. We should consider that there are two ways in which the UBC graduates could differ from other dentists:

135

Brunette-CH10.indd 135

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

20 15 10 5

a

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of successes

Frequency of outcome (%)

Frequency of outcome (%)

25 25

b

20 15 10 5 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of successes

Fig 10-6 (a) Relative frequency histogram for Dentist A. (b) Relative frequency histogram for Dentist C.

They could be better than average or worse than average. In considering performance relative to our reference probability distribution, before collecting data we would consider a result to be different from the reference distribution in two possible ways: 1. More successes than expected. That is, on the basis of the reference distribution, the number of observed successes or a greater number of successes would be predicted to occur 2.5% of the time or less. 2. More failures than expected. That is, on the basis of the reference distribution, the number of observed successes or a smaller number of successes would be predicted to occur 2.5% of the time or less. In this way, the reference distribution is divided so that the left and right tails are both used, and the dividing line is placed 2.5% from each end. In our example, the crude reference distribution does not allow us to place cutoff points at exactly 2.5%, but we can still calculate the expected frequencies. Therefore, we can conclude that Dentist A’s high success rate is unlikely to be due to chance because P = .011, which is less than .025. Such a high success rate would be expected only 1.1% of the time if the null hypothesis were true (Fig 10-6a). Dentist A’s result falls outside the 2.5% cutoff line. On the other hand, we can conclude that there is no evidence to suggest that Dentist B is any different— better or worse—than the average dentist. After all, Dentist B had a success rate of 80%, which matches the most frequently occurring success rate for other dentists. Thus for B, (observed minus expected) = 0. This is not proof that the null hypothesis is true; we can surmise only that there is insufficient evidence to conclude that it is false. Nevertheless, if the data are

such that the null hypothesis cannot be rejected, then it is often said (following Occam) that one accepts it. Dentist C is in an awkward position: C’s record of 13 successes is below average but not significantly so. We see from Table 10-4 and Fig 10-6b that even if Dentist C’s true rate was 80%, observing 7 or more failures occurs more than 2.5% of the time. That is, if C regarded 7 failures as rare, C would also have to classify 8, 9, 10 . . . 20 failures as rare. If we had chosen a larger sample—say, 300 patients— we might decide that more extreme values are needed. There would be 300 results possible rather than just the 20 results possible in our example. Thus, the number of cases distributed into each result would be spread thin, and the percent frequency for many of these possible results would be low—indeed, less than 5%, even for results not far from the most likely value of 240 successes. Thus, in determining whether a particular result differs from the expected value, the question of interest is not the frequency of the particular result (which is determined in part by the number of possible results), but rather the location of the result in the distribution. The result can be located by a method similar to that used to find a particular home by its address; rather than searching for the number, however, one uses the cumulative frequency of all results more extreme than the observed result. In the case of Dentist A (20 successes), we are at the end of the street—there is nothing more extreme, and therefore the cumulative frequency equals the number for that particular result—that is, 1.1%. To locate the position of Dentist C’s result, we would have to add the frequencies for 7 failures (5.5%) plus 8 failures (2.2%) plus 9 failures (0.7%) plus 10 failures (0.2%) plus 11 to 20 failures (approximately 0%) to get a total of 8.6%. Thus, the probability of getting at least the number of

136

Brunette-CH10.indd 136

10/21/19 10:26 AM

Statistical Inference

failures obtained by Dentist C is greater than 5%, and Dentist C can accept the null hypothesis. Dentist D, however, cannot accept the null hypothesis. D can determine that 11 successes in a sample of 20 from a population with an 80% success rate would occur only 0.9% of the time, and the total frequency for 9 or more failures is less than 2.5%. For D, (observed minus expected) ≠ 0; chance alone cannot explain his low success rate. D must reject H0 and sadly conclude that his failure rate is higher than expected.

Some issues in statistical inference One-tailed versus two-tailed tests In our example, the dentists—prior to treating the patients—agreed to determine whether their results differed from the accepted value of 80% success for endodontic treatment, meaning they would be willing to accept results that were better or worse than average. An extremely modest dentist might decide—once again prior to treating the patients—that it would be impossible (for example, because of the dentist’s inexperience) to achieve a better than average success rate. Such a modest dentist would then accept results as being significantly different only if there were more failures than expected. Moreover (assuming the 5% significance level was adopted), the unlikely 5% of the samples would have to be located on the left side of the distribution curve—that is, only one tail (as opposed to both tails) would be examined. Dentist C, for example, would have to accept the null hypothesis. The cutoff line for the lower frequency of success, however, is moved to the right. In our example above, 12 successes would be significantly less (at the 5% level) than the number of successes expected from a trial of 20 with a probability of success of .80.

Two types of error Suppose that you are omniscient God and you know that Dentist B is actually worse than average and that B’s patients have only a 70% chance of success. Thus, you know that B committed an error upon accepting the null hypothesis. Statisticians say that B has committed a type II error and label the chance of erroneously accepting the null hypothesis as β. The power of the test is given by the formula 1 – β. The power quantifies the chances of detecting a real difference of a given size. The calculation of the probability of making a type II error is a complex function of level of significance (ie, α), sample

size, and actual magnitude of the difference between the populations (a value often unknown). Moreover, different formulae apply depending on how the data are distributed. Sample size issues, as well as formulae for various applications, are discussed in van Belle.5 In the preceding examples, we have considered the dentists’ performance relative to a fixed success rate of 80% and a predetermined number of patients (20). However, suppose that two dentists want to compare their success rates relative to one another. We would then want to compare two proportions, each of which can be considered a sample of the outcome for each dentist, for example Dentists B and C: Null hypothesis H0

pB – pC = 0

A common formula for making this comparison follows4: Z=

pB – pC ⎛ pq pq ⎞ + SQR ⎜ ⎟ n2 ⎠ ⎝ n1

Z = the z score for the normal distribution p B, p C are the proportions of successes for Dentists B and C (ie, pB = XB/nB, pC = XC/nC). q = 1 – p. ⎛ X + XC ⎞ p= ⎜ B ⎟ ⎝ nB + nC ⎠

Let XB, XC represent the number of successes for Dentists B and C, respectively, when they treated nB and nC patients. This test uses the normal distribution to approximate the binomial distribution and is considered a good approximate test provided that n is large and neither np nor nq is very small. The normal distribution and the reason it is used to approximate other distributions will be discussed in the next chapter. Suppose one wishes to know how many patients would be required to determine whether patients treated by Dentist B experienced a different success rate from those treated by Dentist C? A simplified formula suggested by van Belle—if one assumes equal sample sizes and a two-tailed test—follows: (1 – p avg )

( ) (p

n = 16 p avg

B

– p c )2

137

Brunette-CH10.indd 137

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

Table 10-5 | Types of errors Decision based on data

Actual situation HO true

HO false

Accept HO

Correct True negative 1–β

Type II error False negative β

Reject HO

Type I error False positive α

Correct True positive 1–α

where pB= 0.80 (Dentist B’s success rate was 16/20); p C = 0.65 (C’s success rate was 13/20). The variance is given by p avg =

(p B + p C ) (0.80 +0.65) = = 0.725 2 2

Substituting the values in the equation: n = 16 × 0.725(1 – 0.725) / (0.15)2 n = 141.8 Thus, each dentist would have to treat about 142 patients to ensure a comparison of reasonable power. This sample size, of course, is much larger than the sample size (20) used to examine the performance of one dentist relative to a fixed success rate, because two samples or sources of variability are present. Note that type II errors are highly probable. For example, it would be difficult to distinguish between populations of 80% and 79% success rates; however, it is probably not a large enough difference to have any practical significance. The other type of error is called α or type I error. This might occur in our example of testing observed versus expected success rates if the dentist’s patients actually had an 80% chance of success but, through bad luck, the particular sample that was drawn happened to have 10 failures. The dentist would then reject the null hypothesis even though it was true. This type of error is at least under the control of the investigator. Setting a higher significance level of, say, 1%, would lower the probability of a type I error. However, for a given sample size, lowering the value of α would increase the chance that an actual difference between groups would fail to be detected. The only way to reduce both α and β simultaneously is to increase the sample size. But a larger sample size may

be difficult to obtain because of factors such as cost and subject availability. Table 10-5 summarizes the possible outcomes of making a decision based on data and the probability of the outcome in terms of α and β.

Lessons to be learned from these examples Statistical inference has the following features: 1. It is based on probability (ie, the observed results are compared with samples selected randomly from a specified probability distribution). There is no possibility of attaining absolute certainty in any statistical test of significance. 2. An arbitrary decision must be made on the level of significance. In some instances investigators do select levels of significance other than 5%, on account of considerations such as the consequences in given situations of committing a type I error. 3. The sample size influences the possibility of observing a significant difference. 4. Only two possible hypotheses—which are mutually exclusive and exhaustive—are assumed: the null hypothesis and the alternative hypothesis. In other words, if you reject the null hypothesis, you must accept the alternative hypothesis.

Goodness of Fit Example of goodness of fit: Mendel’s experiment One class of experiments is specifically designed to determine how well the results fit a theory. Comparison is made not between the results of two or more treatments but between the observed result of a single kind of event and a predicted result based on theory. A classic example of this type of trial is used in genetics, whereby the outcomes of breeding experiments are tested to see if the results can be explained by Mendel’s laws. In such experiments, which utilize nominal scale data, the results are normally expressed as counts. With counts, results obtained are whole numbers. This property of discreteness leads to the distinctive methods used in their statistical analysis. The simplest cases feature only two categories: In Mendel’s pea-breeding experiments, color was used as a category. If homozygous yellow seed–producing

138

Brunette-CH10.indd 138

10/21/19 10:26 AM

Goodness of Fit

and green seed–producing pea plants are bred and the yellow color is dominant, all of the progeny in the first generation will be yellow, as shown below: YY crossed with yy homozygous yellow seed homozygous green seed parent parent All progeny are Yy In the first generation, all progeny produce yellow seeds because they all possess the dominant Y. If the first generation is crossed with itself, however, Yy crossed with Yy YY + Yy + Yy + yy yellow yellow yellow green 3:1 some green seed producers will be segregated, producing a ratio of 3 yellow:1 green. From a statistician’s point of view, the mechanism is irrelevant. The problem is how to test given sets of data to see how well they fit a hypothesis. Among others (including Fisher6), Zar7 has applied this test to Mendel’s data. In 10 experiments (which can be pooled because the samples are homogenous) involving 478 plants, one would expect 358.5 to be yellow (¾ × 478) and 119.5 to be green (¼ × 478), according to Mendel’s law. Mendel actually found 355 yellow peas and 123 green peas. Because of the relatively primitive statistical concepts used at the time, Mendel accepted the findings as close enough to 3:1 to prove the point. Modern statisticians, however, use a chi-squared test where:

χ 2 = Sum of all categories (observed minus expected)2/expected

χ2 = ∑ i

(fi – Fi )2 Fi

where fi = observed frequency in category i, and Fi = expected frequency in category i. Note that if the observed frequency equals the 2 expected frequency χ = 0, whereas a large discrepancy between expected and observed values produces 2 a high value for χ . 2 Two rules must be observed with the χ test: (1) The actual frequencies must be employed in the calculations (ie, it is not valid to convert the data to percentages and then to compare observed percentages with expected percentages); and (2) the expected value

in any comparison should not be less than 1, and less than 20% of the expected values should be less than 2 5.6 Given the formula for χ , it is obvious that the value would be inflated significantly by having low numbers in the denominator. From Mendel’s data:

χ2 =

(355 – 358.5)2 (123-119.5)2 + = 0.137 358.5 119.5 5

2 But if χ > 0, the question we must ask is, how often could we expect to get results that deviate from the expected values by at least that much, based on random samples of a population that is ¾ yellow and ¼ green? In theory (for this case), we could generate such a distribution by making a tetrahedral die with three sides marked yellow and one side marked green, then tossing it 478 times and documenting the number of times green faced down and the number of times yellow faced down. We would repeat this die-tossing experiment many times to see how often the deviations of the observed ratio exceeded those of Mendel. To save us the trouble of making tetrahedral dice (not to mention all the tossing), probability theorists have 2 devised a χ distribution to analyze such data. The 2 shape of the χ curve depends on the number of degrees of freedom (df), which in turn is related to the number of categories compared. The mathematics used to 2 derive the χ distribution (and thus the table of critical values [appendix 4]) is complex and need not concern us here. Use of the table, however, is quite simple. If the ratio obtained by Mendel occurred reasonably often, we would not be in a position to reject the hypothesis that the ratio is 3:1. On the other hand, if the results with such a large deviation seldom occurred—say, less than 5% of the time—then we would conclude that it is unlikely the sample came from a population where the color ratio was 3:1. Of course, the results of Mendel’s experiment fit well with his 3:1 hypothesis. Since there are only two groups (yellow or green), the df is calculated as follows:

(No. of groups – 1) = 2 – 1 = 1. When we consult the table of critical values (see appendix 4), we find that with 1 df, the critical value at the .05 level of significance is x21 = 3.841, which of course is much greater than the 0.137 value calculated from the observations. Therefore, the observed numbers do not differ significantly from those expected from a population with a 3:1 ratio. Note that the same

139

Brunette-CH10.indd 139

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

reasoning applied earlier in another statistical test is applied here: There is an arbitrary level of significance (here α = 5%), and the null hypothesis is tested for rejection. Moreover, the test is directional in the sense that it does not prove that the sample came from a population with a color ratio of 3:1 but only that the data are compatible with such a proposal. Many, including Fisher, 4 have speculated that Mendel’s results are “too good to be true”—that finding the data to be in such close agreement with a hypothesis would be extremely rare, raising the possibility that Mendel cooked his data to agree with his hypothesis. More recent research6 has tended to look at Mendel’s results less critically. For example, it is known that Mendel also published data that did not agree with his hypothesis, and it is difficult to understand why he would cook his data in one case and not the other. In the time since Mendel performed his experiments, there have been many developments in statistics, of course, including the practice modern scientists adopt of “stopping rules” to determine when an experiment should be curtailed. One explanation that has been proposed is that Mendel simply collected data until it supported his hypothesis and then stopped. In any case, Fisher was reluctant to assign personal blame to Mendel and noted the possibility “that Mendel was deceived by some assistant who knew too well what was expected.”

Why use nonparametric statistics? In statistics, the term parameter refers to a quantity used in defining the distribution of a variable (such as age) in a population. The binomial distribution, for example, uses parameters defined by the number of trials and the probability of success in each one. Unlike in tests that are based on a normal distribution of values, in 2 the χ test used in the Mendel example no parameters, such as a mean or a variance, were calculated from the 2 data. The χ test in that example is a nonparametric statistic; it is distribution-free, which means that the form of the distribution does not have to be specified and only the expected values are required. There remains the assumption that the difference between the observed values and the expected values arises from a random sampling. Another requirement is that the expected values, as noted earlier, should not be less than 5 in more than 20% of the cells. It should also be noted that measured values of variables can 2 follow a χ distribution, in which case parameters such as df and mean are involved.

The logic of nonparametric inference is based on the principles of finite probability and makes use of the concept that an event’s probability can be thought of as the frequency of that event relative to other events 2 over the long term. In addition to χ , a number of other nonparametric tests have been developed to analyze other types of comparisons. The major limitation of nonparametric tests is that they are generally less sensitive compared with a classical test used on parametric data (eg, t test).8 Nonparametric tests are less likely to detect a significant difference between groups, if such a difference is present, and hence more likely to accept the null hypothesis even when it is false. For some data, one could choose to use a parametric test or a nonparametric test. Willemsen9 offers three reasons investigators use nonparametric tests: 1. The assumptions prevalent in common parametric tests about distribution and parameters are not met by the measurements used in a study. 2. A general attitude dictates that research conclusions should be based on as small a number of untested assumptions as can possibly be arranged. 3. Investigators determine that a particular nonparametric test is the one most likely to reject the null hypothesis being tested if it is indeed false.

Example of goodness of fit to a probability distribution: Poisson’s distribution The binomial distribution deals with situations in which it can be determined when an event (eg, success or cure) occurred or did not occur (eg, failure) and the total number of possible outcomes (eg, patients treated) is known. Often for natural phenomena, however, the occurrence of events can be measured (such as flashes of lightning), whereas the frequency of occurrence of nonevents is not evident. Because a flash of lightning can occur in an infinitesimally small amount of time, it would be difficult to calculate in how many of these units of time a flash of lightning did not occur. Often, the probability of a given number of occurrences can be calculated using the Poisson distribution, which describes isolated events or occurrences in a continuum, such as space or time. The only condition is that the expected number must be constant from trial to trial. Sometimes called the law of small numbers, the Poisson distribution is often used in describing rare events and can be

140

Brunette-CH10.indd 140

10/21/19 10:26 AM

Goodness of Fit

P(r) =e –m ×

where r! = r (r – 1) • (r – 2) • (r – 3) . . . . 1 (eg, 5! = 5 • 4 • 3 • 2 • 1 = 120) and m = mean number of events per unit of time (in this example, year) e = base of natural or Napieran logarithms 1 1 1 + + + . . . = 2.7183 0! 1! 2!

The Poisson distribution is defined by just one parameter—the mean or variance (since, for the Poisson distribution, the variance equals the mean). Because it is often used to look at rates or number of events with time, the Poisson distribution sometimes expresses the mean as mean = r/t, where r is the rate per unit time and t is the length of time under consideration. The data from the Prussian cavalry deaths by horse kick yield an estimate of 122/200 deaths per year per corps (ie, m = 0.61). Using the formula, we can calculate the probability for 0, 1, 2, 3, and 4 deaths per year as expected by the Poisson distribution (Table 10-6). Inspection of these numbers reveals that the Poisson distribution demonstrates a very good fit with the actual 2 observed data. One could do a χ analysis as follows:

χ2 =

(109 – 109)2 109

+

(65 – 66.3)2 66.3

80 60 40 20 0 1 2 3 4 5 6 Deaths from horse kick

Fig 10-7 Relative frequency histogram.

mr r!

P(r) = probability of event (in this example, death by horse kick) occurring r times

=

100 No. of years this no. of deaths occured

considered an approximation of the binomial distribution, where P (probability of event occurring) is very small, and n is relatively large. A widely used example of this distribution is the data collected on the number of men in the Prussian army who were killed by horse kick over 20 years in 10 army corps (Fig 10-7). A full analysis of the data by Moroney10 reveals that the condition of constant expectation was met. Had the 10 corps varied greatly in number of men or horses, for example, or had the Prussian cavalry instituted a “Safety with Horses” campaign during the period of the study, then that condition would not have been met. As we can see, the distribution is highly skewed. The equation for the Poisson distribution is:

+

(25 – 24.9)2 24.9

(The last three cells have been combined so that the expected value for any cell is not < 1 and so that less than 20% of the expected values are < 5.11)

χ 2 = 0 + 0.02549 + 0.0004 = 0.02589 2 For df = 2 and α = 0.05, the χ 2crit= 5.99 2

Thus, because the χ value for the fit to the null hypothesis is much less than the critical value, we must accept the null hypothesis and conclude that the observed fluctuations from those predicted by the Poisson distribution could be explained by chance. Note that the Poisson distribution does not produce the normal bell-shaped distribution—also called the Gaussian distribution—that is used most widely in statistics (see chapter 11). Knowing the distribution of values that occurs in a sample is important because it allows estimates to be made about the parent population. Kac12 recounts the story of the mathematician Steinhous, who was forced to live under an assumed name in Poland during the Second World War and to hide in the forest most days and nights. The only information he had of the war was from a German-controlled newspaper, which functioned for propaganda purposes and emphasized the great victories of the German armies. Steinhous wondered about the true losses of the German army. He noted that the authorities allowed a fixed number of obituaries of the form “Hans the son of Klaus and Hildegard Schmidt fell for the Fatherland.” However, some of the obituaries read “Dieter the second son of Klaus and Hildegard Schmidt fell for the Fatherland.” Knowing that the number of male offspring in a family follows a Poisson distribution and knowing the average number of male offspring, Steinhous could determine the proportion of soldiers killed. From this he could estimate the true losses of the German army,

141

Brunette-CH10.indd 141

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

Table 10-6 | Distribution of no. of deaths per year by horse kick for the Prussian cavalry 0

1

2

3

4

Frequency taken from actual data

No. of deaths per year per corps

109

65

22

3

1

Probability

.543

.331

.101

.021

.003

Frequency expected by Poisson (m = 0.61)

109

66.3

20.2

4.1

0.6

Table 10-7 | Observed frequencies of caries in Saudi naval men Carrier type

Caries present

Caries absent

Row total

e-type carriers [+]

52

4

56 = R1

Non-e-type carriers [–]

110

30

140 = R2

162 = C1

34 = C2

196 = total N

Column total Data from Keene et al.13

and these vast numbers gave him hope that Germany would eventually lose the war. The estimate proved to be quite accurate, which is a tribute not only to Steinhous but also to the power of understanding probability distributions. The Poisson distribution has many applications: It describes situations such as the distribution of Simplified Oral Health Index (OHI-S) scores; the number of blood cells found in one square of a hemocytometer; radioactive decay; cohort studies of diseases with rare events; and even the number of goals in a soccer match.

the value found in row r and column c. Thus, in this example, f11 = 52. The df for this test is: df = (no. of rows – 1) × (no. of columns – 1) = (2 – 1) × (2 – 1) = 1 × 1 = 1 The total frequency for the row (Ri) is the sum of all frequencies in the row, where i = the row number and j = the column number. In our example, 2

R 1 = ∑ f1c = 52 +4 = 56 C=1

Contingency Tables: An Especially Useful Tool in Evaluating Clinical Reports Contingency tables, also known as bivariate frequency distribution tables, allow investigators to identify an association between two variables. These tables can be particularly useful in evaluating clinical dental research articles because the results are often presented in a form that can be considered as nominal scale data. Patients will be either male or female, their problem will fall into some category, and the treatment will be a success or failure. The relationship among these variables can be elucidated through contingency-table analysis. For example, consider the data of Keene et al13 on Saudi naval men (Table 10-7). As presented, this is a contingency table. It has four cells and shows the totals for the rows and columns. The frequency frc refers to

Similarly, for column 1, 2

C1 = ∑ fr1 = 52 +110 = 162 r =1

Next, we can calculate the expected frequencies if there was no association between the variables according to the formula: Frc =

R r × Cc n

where n = total in the sample. In our example, F11 =

56 × 162 = 46.3 196

The values of the expected frequencies are calculated in Table 10-8. Now the x2 value can be calculated for each cell (Table 10-9).

142

Brunette-CH10.indd 142

10/21/19 10:26 AM

Resources for Statistics

Table 10-8 | Expected frequencies of caries in Saudi naval men Carrier type

Caries +

Caries –

e-type carriers [+]

46.3

9.7

Non-e-type carriers [–]

115.7

24.3

Data from Keene et al.13

Table 10-9 | χ values for Table 10-7 2

Observed – expected Row 1 column 1

After continuity correction

(Observed – expected)2

(Observed – expected)2 Expected

52 – 46.3 = 5.7

5.2

27

Row 1 column 2

4 – 9.72 = –5.7

–5.2

27

2.78

Row 2 column 1

110 – 115.7 = –5.7

–5.2

27

0.23

Row 2 column 2

30 – 24.3 = 5.7

5.2

27

Total χ

2

2 The calculated χ is 4.70 with df = 1. The critical 2 value of χ is (α = .05, df = 1) = 3.84. Because 4.70 > 3.84, we must reject the null hypothesis and conclude that for these Saudi naval men, the presence of caries is significantly associated with the possession of e-type Streptococcus mutans. Despite this statistically significant association, however, an analysis to be presented later shows the strength of the association to be quite weak. Thus, this example demonstrates that statistical significance and size of effect are separate facets of data interpretation. An assumption of some commonly used statistical tests is that the samples are drawn from a parent population with a Gaussian (normal) distribution. However, measurement of some characteristics might show frequency distributions that are decidedly non-normal. Often when we measure any sample of objects, we are trying to infer something about the parent population. Suppose we had only a small number of sample values, and consequently we did not get a frequency distribution curve. We could first calculate the mean, m, from our (small) sample. If we assumed that the parent population was Poisson distributed, we would make our inferences on the basis of a skewed curve, whereas if we assumed the parent population was normally distributed, we would use a curve that was symmetrically centered around m. Thus, our assumptions would have influenced our conclusions—clearly undesirable. As noted earlier, some scientists prefer to keep the number of assumptions to a minimum and to use distribution-free tests. Others appear to assume that the normal distribution applies to every

0.58

1.11 4.70

measurement. This is a dubious assumption. Apparently, in many instances in which large bodies of data on observational variation have been tested against the normal distribution, they have shown disagreement. The essential point here is that the assumption of a particular distribution can influence the conclusion.

Resources for Statistics A number of reference texts are available that present statistical concepts and calculations in greater detail than occurs in this book. The ones listed here are books I have found useful. My experience is that the explanations of various concepts varies in detail or approach among the books. A common technique of teachers is “to put it another way” if the student doesn’t grasp the first explanation. So if you find a concept confusing it is sometimes useful to consider how a different author explains it. Box GEP, Hunter JS, Hunter WG. Statistics for Experimenters: Design, Innovation, and Discovery, ed 2. Hoboken: John Wiley & Sons, 2005. A very useful book as attested to by the fact that my copy of the 1st edition was borrowed and never returned by a colleague. Fleiss JL. Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1986.

143

Brunette-CH10.indd 143

10/21/19 10:26 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 1: DISCRETE VARIABLES

Includes section on reliability and decayed, missing or filled surfaces (DMFS) scores. Huck SW. Reading Statistics and Research, ed 6. Boston: Pearson, 2012. Also has a companion website. This is the first book that my periodontist colleague and former student, Hugh Kim, reaches for when he encounters a statistical issue or an unknown test. Kruschke JK. Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Amsterdam: Academic Press, 2011. A good introduction to this burgeoning field, but not for the faint-hearted. The BUGS does not refer to problems with the tutorial but with widely useful software for modeling. Lang TA, Secic M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers, ed 2. Philadelphia: American College of Physicians, 2006. Part writing guide, part statistics instruction; overall an excellent general resource book. Lesaffre E, Feine J, Leroux, Declerck D. Statistical and Methodological Aspects of Oral Health Research. Oxford: John Wiley & Sons, 2009. As indicated by the title, focuses on oral health research. Norman GR, Streiner DL. Biostatistics: The Bare Essentials, ed 2. Hamilton, Ontario: Decker, 2000. Informal in tone, for example, they cite John Cleese: “Too many people confuse being serious with being solemn.” Staff of Research and Education Association. The Statistics Problem Solver. New York: Research and Education Association, 1982. Many worked out problems.

Zar JH. Biostatistical Analysis, ed 5. Englewood Cliffs, NJ: Prentice Hall, 2010. When filed alphabetically by author, this book is last, but when faced with the unfamiliar or dubiously remembered, the book I look at first.

References 1. 2. 3.

4. 5. 6.

7. 8. 9. 10. 11.

12. 13.

Gould SJ. The New York Times Review of Books. 18 August 1988:35. Gleason RW. The Essential Pascal. Toronto: Mentor-Omega, 1966:89–97. Larkin PA. Notes for Biology 300 (Biometrics). A Handbook of Elementary Statistical Tests. Vancouver, BC: University of British Columbia, 1978:15–22. Fisher RA. The Design of Experiments, ed 8. New York: Hafner, 1966:13. Van Belle G. Sample size. In: Van Belle G. Statistical Rules of Thumb. New York: Wiley, 2002:29–51. Piegorsch WW. Fisher’s contribution to genetics and heredity, with special emphasis on the Gregor Mendel controversy. Biometrics 1990;46;4:915–924. Zar JH. Biostatistical Analysis, ed 2. Englewood Cliffs, NJ: Prentice Hall, 1984:70. Porkess R. Collins Web-Linked Dictionary of Statistics, ed 2. Glasgow: HarperCollins, 2005. Willemsen EW. Understanding Statistical Reasoning. San Francisco: Freeman, 1974:170. Moroney MJ. Facts from Figures. Harmondsworth, UK: Penguin, 1965:96–107. Grace-Martin K. Observed values less than 5 in a chi square test— No biggie. The Analysis Factor. https://www.theanalysisfactor. com/observed-values-less-than-5-in-a-chi-square-test-no-biggie/. Accessed 23 July 2019. Kac M. Marginalia: Statistical odds and ends. Am Sci 1983;71:186. Keene HJ, Shklair IL, Anderson DM, Mickel GJ. Relationship of Streptococcus mutans biotypes to dental caries prevalence in Saudi Arabian naval men. J Dent Res 1977;56:356.

144

Brunette-CH10.indd 144

10/21/19 10:26 AM

11 Elements of Probability and Statistics, Part 2: Continuous Variables

T

he previous chapter largely concerned discrete variables (mainly counts), but most measurements deal with continuous variables, which, at least in theory, can take on an infinite range of values. An example of a discrete variable is the value found on the throw of a die. The mean value for a single throw of a die is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5; however, we do not actually observe the value 3.5, because only the discrete integers 1, 2, 3, 4, 5, and 6 are possible. A continuous variable, conversely, is a variable for which there is virtually an infinite number of possible values (given a scale of sufficient magnitude). Recent work on the weight of an electron, for example, has described it to 17 decimal places. When describing the frequency distribution of measurements of continuous variables, it is not possible to put every measurement into its own frequency cell or bin, as we could in the frequency histogram in the example for the binomial distribution (which had 20 bins for the 20 possible values) (see Fig 10-5). Indeed, most of the infinite number of bins representing the infinite measured values of a continuous variable would not contain any measurement if the measurements were taken from a finite population. To calculate the distribution of values for a given range of a continuous variable, statisticians apply the concept of probability density. A simple application of this concept is demonstrated in the uniform distribution (Fig 11-1), which has a constant probability density that is equal to 1 throughout the range of the variable from x = 0 to x = 1. Because the ordinate is 1 throughout this range and because the highest possible x value is 1, the total area of the uniform distribution is 1 (ie, 1 [the width] × 1 [the height] = 1). The uniform distribution of a variable is useful for comparing the relative probability of mutually exclusive events. For example, to calculate the probability that a value selected at random is ≤ 0.4, we would measure the area under the uniform distribution curve from x = 0 to x = 0.4 and find that this



Probability theory is nothing but common sense reduced to calculation.” PIERRE SIMON LAPLACE 1

145

Brunette-CH11.indd 145

10/21/19 10:45 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 2: CONTINUOUS VARIABLES

Probability density

1

Probability density

1

P=1

0

X

1

P = .4

0

.4

1

X

Fig 11-1 The uniform distribution in which the probability density is constant throughout the range of possible values from x = 0 to x = 1, and “0” elsewhere. Total area = 1.

Fig 11-2 Probability that a randomly selected value of x ≤ 0.4.

area is equal to 0.4. In other words, the probability (P) = .4. The probability that a value selected at random is greater than 0.4 is mutually exclusive of the event that P ≤ .4, so its probability can be calculated as equal to 1 – 0.4, or P = .6. Graphically, this can be illustrated as shown in Fig 11-2. The uniform distribution is a simple relationship, and calculation of areas, as shown above, is straightforward. For more complex distributions, integral calculus is used to calculate areas under probability density curves. Just as differential calculus concerns rates of change in infinitesimally small steps, integral calculus calculates the cumulative effect of many small changes in a quantity—such as occurs when the values of a continuous variable are spread over a range specified by a probability density function. In other words, integral calculus allows us to calculate the area under the curve of a probability density function for a given range of a continuous variable, and that area corresponds to the probability that values in that range will occur (given the condition that the total area under the probability density curve is 1).

Empirically, many types of scores have been found to follow the so-called normal or Gaussian (after the mathematician Carl Gauss, 1777–1855) distribution. The distribution is called normal because some early statisticians mistakenly believed that most probability distributions followed the “normal” curve, whose mathematical equation is given by:

The Normal Distribution In the preceding chapter, we looked at sampling from the binomial distribution, as well as possible outcomes from the sum of two dice; both of these examples used discrete variables. We now look at what we expect when we sample a variable with a continuous distribution.

p(x) =

1

σ 2π

e



2 1 (x–μ ) σ2 2

where p(x) = probability density function σ2 = variance x = the variate µ = mean value of x for the population—ie, the average Note that, by convention, one uses the Greek symbols µ for the mean of the population and σ2 for the variance of the population. When describing samples from – the population, one uses X for the mean and s2 for the 2 variance. For a number of applications, it is useful to consider the so-called z score, defined as:

z=

x–μ σ

If the mean µ = 0, and the standard deviation (SD) σ = 1, equation 1 reduces to:

p(x)=

1 2π

e



1 2 z 2

146

Brunette-CH11.indd 146

10/21/19 10:45 AM

Confidence Intervals

Confidence Intervals Probability density

Sample from a population of known dispersion .3413 .1360 .0214 .0013

-3𝛔

-2𝛔

-1𝛔 Mean 3𝛔

2𝛔

1𝛔

Fig 11-3 Standard normal distribution curve. (Areas not to scale.)

This is called the standard normal distribution. The constant 1 2π is a scale factor to make the total area under the normal curve equal to 1. The standard normal distribution curve (Fig 11-3) has many uses, but it is perhaps most commonly used to calculate the percentage of values beneath some portion of the curve or the likelihood of any particular value. For example, if you scored 90 on a test in which the mean grade was 70 and the SD was 10, you could infer—if the grades were normally distributed— that you performed better than about 97.5% of the class, because your grade was 2 SDs above the mean. Similarly, the teacher would expect the percentage of students whose grades fell between 70 and 80 to be 34%. Interestingly, the normal curve never touches zero—it extends out to infinity in both directions, so the existence of some particular high or low value is rare but never theoretically impossible. As an illustration of this principle the tallest man in recorded medical history was said to be Robert Wadlow, who was measured at 8 feet, 11 inches (2.72 m), almost 6 SDs above the mean height of American males.

A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. Similarly, sometimes scientists will be interested in the difference in means—for example, between a treated and a control group. To be exact, the confidence interval in that instance means that if a series of identical studies were carried out repeatedly on different samples from the same population, and a 95% confidence interval for the difference between the sample means calculated in each study, then in the long run the confidence intervals would include the population difference between means. Suppose we want to estimate the amount of foul-smelling CH3SH in air expelled from the human mouth under a certain condition. Moreover, suppose that the only material we can get is provided in one sample that is representative of the population of measurements that may be taken under that condition. We want to use the information from that sample to estimate the true value of CH3SH level for the population. We take the sample to be run on a gas chromatograph under the direction of the father of malodor analysis, Dr J. Tonzetich. After the analysis, Tonzetich reports that there are 147 ppm of CH3SH in that sample. However, we do not know the accuracy of that determined value and want some estimate of how random fluctuations affect the value. Because Tonzetich has been running the gas chromatograph machine for 30 years, we decide to ask him, and he replies that his experience, based on 10,000 measurements on a calibrated reference standard, is that the SD is 5 ppm. We are now in a position to calculate the range within which the true value for the population may be found. First, we must decide on a level of certainty, and, like many biologists, we will be satisfied with 95% certainty. If we know (for example from plotting Tonzetich’s population of 10,000 samples) or assume that the values are normally distributed, then we can infer that 95% of the values fall within 1.96σ (where σ is the population SD) of the true value. Therefore, if we take the determined value and go 1.96σ in either direction,

147

Brunette-CH11.indd 147

10/21/19 10:45 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 2: CONTINUOUS VARIABLES

a

1.4

b

c

d

e

1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -4

-2

0

-2

-4

-4

-2

0

-2

-4

-2

0

-2

-2

0

-2

-2

0

-2

Fig 11-4 Effect of sample size on the expected distribution of means: (a) original distribution, (b) distribution of means of sample size of 2, (c) 4, (d) 8, (e) 12.

there is a 95% chance that the range so defined—called confidence limits (CL)—will include the true value. Hence, we calculate as follows: Confidence interval (CI) = range in which true value will be found 95% of the time = determined value ± 1.96σ SD = 147 ± 9.8 ppm. Technically, a 95% CI (ie, mean ± 1.965) means that if we performed the sampling experiment repeatedly, we would get a value within the CI 95% of the time.

Sample from a normal distribution Suppose that, instead of one sample, we had four samples. Clearly, this puts us in a better position, because the chance that all four samples were drawn from some extreme of the population distribution is smaller than the chance of that occurring with one sample. When calculating the mean of the four samples, we would expect a better estimate of the mean of the population than if we simply took a

single sample. But exactly how much better? If we are sampling from a population with a normal distribution, the distribution of the means for any given sample size is also normal, with an SD given by σ/ n . This new statistic, which describes the sample-tosample variation in the mean from samples of size n, is called the standard error (SE):

SE =

σ n

We can illustrate graphically how the distribution of means changes with the sample size. Figure 11-4 shows the expected distribution of means when a sample size of 2, 4, 8, and 12 is used to construct means. The CL for these distributions can be shown mathematically:

CI = X ± z

σ n

where z (based on the normal distribution) = 1.96 for 95% CI and 2.58 for 99% CI, etc.

148

Brunette-CH11.indd 148

10/21/19 10:45 AM

Confidence Intervals

Assuming for the sake of illustration that the four samples have a mean of 147, the CI becomes: CI = 147 ± 1.96s

Thus, if the mean of the four samples was 147 and the SD was 10, the 95% CIs would be:

ppm = 147 ± 4.9

Note that the CI has been halved by using a sample of 4.

Sample from a population whose dispersion is not known “a priori” Once again, suppose we want to measure sulfur levels in air expelled from the mouth, but we have offended Tonzetich because we refused to drink his homemade wine, Chateau Tonzetich. Therefore, we consult another biochemist; however, lacking Tonzetich’s experience, this biochemist does not have data to provide us with the SD of the measurement. But, because we (again) have four samples and hence four values, we can calculate an SD for our combined samples. The problem becomes how to use the SD calculated from the sample to arrive at the population’s SD. Fortunately, this mathematical problem has been solved by Gosset, who published under the pseudonym Student so competitors would not know the sophisticated statistical control measures being used by his employer, Guinness Brewery. Gosset found that values could be calculated that could convert the samples’ SD s to the population’s SD σ. In brief, he derived these numbers, called the Student distribution, or t distribution, by sampling from a normal population. This exercise in probability theory yielded a distribution curve, the shape of which was found to depend on the size of the sample, or the number of degrees of freedom (df) of the sample. As the number in the sample increases, his curve approaches the normal curve. Based on Gosset’s work, CIs can be calculated as follows:

CI = X ± t

s n

– – where X= mean of sample (in this example, X = 147) s = SD of sample n = number of observations t = a constant for a given number of df (n – 1) and certainty The actual value of t can be located in statistical tables (see appendix 3).

CI =147 ± t ×

10 4

ppm

From the table, t (df = 3, α = .05) = 3.182 CI = 147 ± 15.9

Estimation Calculation of CIs is an example of estimation, the process whereby one calculates the value of an unknown quantity—here the level of sulfhydryl components— on the basis of a particular sample of data. There will always be some uncertainty associated with an estimate; that uncertainty is indicated by the CI—typically 95%. If, for example, one measured sulfhydryl levels in a subject’s mouth air another four times (ie, four new samples), one would probably get a slightly different mean and a slightly different 95% CI. However, if the sampling process were repeated many times, we would find that the CIs calculated from our samples would cover the true population mean about 95% of the time. From the perspective of dental clinicians, an important use of CIs arises during meta-analysis of the literature, in which the effects of a treatment found in different studies are compared. For example, Worthington3 compared the change in attachment level in four studies of guided tissue regeneration. Figure 11-5 demonstrates that only one of the studies (Study C) included the value indicating no effect, and three of the studies had CIs that fell in the region favoring the control—ie, the untreated group. In meta-analysis, studies are often combined and weighted by such variables as number of subjects in the study to produce a combined CI. An advantage of using CIs, as opposed to simple null hypothesis testing, is that one sees immediately the relative variation and range of possible effect sizes, rather than basing one’s decision on a single test. Moreover, as shown in the meta-analysis example, one can compare different studies directly or with regard to other methods of treatment. Another common use of CIs is to calculate the CI of the difference between two means, typically the difference between control and treated populations. If the CI includes zero, then the treatment has not been demonstrated to have a significant effect. One might wish to use CIs to predict the results of hypothesis testing in comparing means directly.

149

Brunette-CH11.indd 149

10/21/19 10:45 AM

ELEMENTS OF PROBABILITY AND STATISTICS, PART 2: CONTINUOUS VARIABLES

Study A Study B

Study C Study D

-2

0

2

4

Favors control

No effect Fig 11-5 Example of a meta-analysis for the effect of a treatment for continuous outcome variable. Three studies (A, B, D) found the treatment to be less effective than the control. Study C found no effect.

Intuitively, it would seem that if the 95% CIs for the estimated mean of two populations overlapped, then the means for the two populations (say, treated and control groups) would not differ significantly at the 5% level. However, CIs for two statistics can overlap by as much as 29% and yet the statistics can be significantly different.2

Relationships Between the Normal, Poisson, and Binomial Distributions The binomial distribution discussed earlier (and illustrated in Fig 10-5) lacks the familiar symmetric bell-shaped appearance of a normal curve. However, if p (proportion of successes) or q = (1 – p) = 0.5, then the binomial distribution is symmetric. Moreover, as n, the number of trials, increases to become at least greater than 20, and for any given probability P such that .2 < P < .8, it turns out that the binomial distribution can be approximated by the normal curve with: the mean µ = np, and the SD σ = np (1 – p ) The value of n required to get a reasonable approximation is smallest when P = .05 and increases as the value

z=

x −μ σ

where x = number of heads µ = mean σ = SD Thus,

μ = np = 100 × 0.5 = 50 npq = 100 × 0.5 × 0.5 = 5 ⎛ 55 – 50 P ( χ ≤ 55) = P ⎜ Z ≤ = P ( Z ≤ 1) ⎝ 5

σ=

⎛ ⎜ ⎝

-4

Favors treatment

of p or q increases, because the underlying binomial distribution is less symmetric. An advantage of using the normal distribution to approximate the binomial is computational; tables giving the probability represented by various areas on standard normal curves are common, and scientists are accustomed to working with them. An example is the use of z scores of the standard normal distribution to calculate probabilities of events. For example, one might wonder how rare it is to flip a coin 100 times and get 55 or fewer heads. Assuming a fair coin (that is, the expected number of heads is 50), one could calculate:

Looking up a table of Z, one finds that the area of the standard normal curve between 0 and 1 is 0.341. Another 0.5 of the standard normal curve area is found for values less than z scores of 0. So the total probability of getting 55 heads or fewer in 100 flips of a coin is .841, or 84.1% of the time. One can also use the normal approximation to calculate CLs for the expected number of events using the formula

SE =

SD n

Thus, as with the earlier discussion of parametric data having means, the 95% CI for the expected number of events will be:

95% CI = np ± 1.96 np (1 – p) Finally, the binomial distribution can also be approximated by the Poisson distribution, when either p or q is fairly small; the closer p or q is to zero, the better the approximation.

150

Brunette-CH11.indd 150

10/21/19 10:45 AM

References

Advantages and additional applications of confidence intervals Confidence interval can also be calculated for proportions and their differences, regression and correlation, relative risks, odds ratios, standardized ratios and rates, and survival time analysis. There are also procedures for methods of calculating confidence intervals for some nonparametric analyses. See Gardner el al4 for detailed information. A major benefit of using confidence intervals is that they indicate the precision of the sample study estimates of population values represented on the scale of the original measurements, a more informative approach than simply using P values that dichotomize findings into significant and nonsignificant findings. Thus, many medical journals expect papers submitted to them to contain confidence intervals where appropriate. Dissatisfaction with the significance test is widespread in science. In March 2019, over 800 scientists signed an article inveighing against statistical significance testing5 in a call for an end to hyped claims and the dismissal of possibly crucial effects. The special issue of Nature in which it appeared had articles such as “It’s time to talk about ditching statistical significance.”6 Ominously, this editorial had the subtitle: “Looking beyond a much used and abused measure would make science harder, but better.” It remains to be seen whether this call will be influential in changing scientists’ statistical approach, ingrained as it has been for almost a century. The significance test has been under attack for a long time. This debate may be one of those instances that to be resolved will require the passing of a generation of scientists who have grown attached to it.

Concluding Remarks

used in science. More information, such as simple calculations for estimating sample sizes for parametric data and effect sizes, are given later. However, statistics is a complex field, and the information presented in this book would be sufficient for some types of statistically simple laboratory and clinical research but insufficient for others. For those readers planning an investigation, it is worthwhile to consult a statistician prior to embarking on a study, so that common statistical pitfalls can be avoided. It should also be noted that the field of statistics is continually developing, and, as in dentistry, specialists—for example, experts in cluster sampling—have emerged. Thus, just as in dentistry, where you may or may not need a specialist endodontist to do your root canal, in designing your investigation, depending on the complexity of the problem, you may or may not decide to consult a specialist statistician for advice, but it is frequently useful to do so.

References 1.

2. 3.

4.

5. 6.

Laplace PS. Théorie analytique des probabilités, 1812. Cited in CIM Bulletin no. 11, December 2001, Gellerie: Diogo Pacheco d’Amorim. Van Belle G. Statistical Rules of Thumb. New York: Wiley, 2002:39. Worthington H. Understanding the statistical pooling of data. In: Clarkson J, Harrison J, Ismail A, Needleman I, Worthington H (eds). Evidence-Based Dentistry for Effective Practice. London: Martin Dunitz, 2003:75–87. Gardner MJ, Altman DG, Machin D, Bryant T. Statistics with Confidence: Confidence Intervals and Statistical Guidelines, ed 2. Hoboken: John Wiley & Sons, 2013. Amrhein V, Greenalnd S, McShane B. Scientists rise up against statistical significance. Nature 2019;567:305–307. It’s time to talk about ditching statistical significance. Nature 2019;567:283.

A list of statistics resources is provided before the references in chapter 10.

This and the preceding chapter provide an elementary introduction to the statistical concepts most widely

151

Brunette-CH11.indd 151

10/21/19 10:45 AM

12 The Importance of Measurement in Dentistry



At a meeting of the Evolution Committee of the Royal Society, Weldon had read a paper on the sizes of the carapaces of a certain population of crabs. Bateson, who considered the results of no biological importance, when asked by the Chairman of the Committee to comment did so in a single devastating sentence: Though all science might be measurement, all measurement was not necessarily science.” VERNON HERBERT BLACKMAN 1

A

Canadian Reader’s Digest article published in 1998, entitled “How Honest Are Dentists?,” described the experiences of an investigative reporter who was examined by and received treatment recommendations from 45 dentists randomly chosen from across Canada.2 The results, which to a certain extent impugned dentists’ honesty, revealed that the cost of the recommended treatments ranged from $18 to more than $18,000. While some of the difference in cost estimates could doubtless be ascribed to treatment philosophy, much of it resulted from dentists’ inability to agree on which teeth had caries. Categorization is the simplest method of measurement, and yet the 10 dentists who examined the reporter in British Columbia and Alberta, for example, differed significantly. Three of the 10 pronounced the reporter caries free, while others identified caries in nine different teeth. The diagnosis of caries, surely one of the most important activities in clinical dentistry, thus appears to be less than exact in practice, and, indeed, developing new methods of caries detection constitutes an active area of dental research.3 As new technologies are applied to dentistry, new measurement techniques are often required to assess their effectiveness. For example, the widespread use of dental implants combined with the growing consumer interest in esthetics led to the need for an index to evaluate the soft tissue around single-tooth implant crowns, and new indices have been developed.4,5 Measurement is an important component of dental research; this chapter reviews the basic concepts.

Operational Definitions By performing certain operations, we can obtain numbers that describe an object; such description involving the use of numbers is measurement. For a measurement to be well specified, the conditions and procedures that define it must be stated in such explicit detail that anyone could perform the same operations and obtain the same

152

Brunette-CH12.indd 152

10/9/19 11:12 AM

Operational Definitions

number. The physicist Bridgman6 argued that every scientific term must be specifiable by a definite testing procedure that provides criteria for its application. These criteria are called operational definitions. One operational definition for the presence of plaque is to stain it by having the subject rinse with a solution of dye; if plaque is present, a specific result is obtained, namely the presence of visible stained material on the tooth. The testing procedure could be performed in various ways, such as with or without rinsing, thus introducing the possibility that loosely held stained material would be eliminated. Thus, the nature of the testing procedure determines what is being measured. Although many techniques claim to measure the same property or characteristic, they actually differ markedly in what they measure. For example, the following substrates have been proposed to measure the clearance of food debris ingested for experimental purposes.7 • Gingerbread biscuits containing copper or iron • Standard biscuits with ferric oxide as a marker • Peanut butter with radio-iodinated serum albumin as a marker • Soda cracker debris stained with iodine Because soda crackers and peanut butter vary considerably in stickiness, the measurement of retention or clearance of ingested food debris clearly depends very much on the manner in which the test is performed. Different definitions can give rise to markedly different estimates of the prevalence of disease: Two radiographic studies of the prevalence of chronic periodontitis in 13- to 15-year-old adolescents in England and Denmark gave estimates of 0.06%8 and 51.5%,9 respectively. The radical difference appears to be the result of varying diagnostic radiographic criteria used in the two studies. Wilson10 proposed that a full-fledged operational definition of a scientific quality has four stages: 1. An intuitive feeling for the quality 2. A method of comparison that allows the determination that A has more of the quality than B 3. A set of standards against which the quality can be compared and by which categories can be devised 4. The interrelation of standards as may be required if there are different ways of measuring the same property

Resolution Resolution expresses how finely detailed a measurement can be. At first it might seem that resolution is a straightforward concept. For example, it would seem self-evident that a ruler marked in centimeters could provide measurements to the nearest centimeter and no smaller. If we were using the ruler to compare the lengths of two lines and both lines fell between the 7- and 8-cm marks, we would have to conclude that they were the same length. However, if one of the lines was 7.1 cm and the other 7.9 cm, it would be obvious to even a casual observer that the lines differ in length. Typically, scientific observers interpolate between the markings. Indeed, there is a fairly general belief that a trained observer can carry the measurement to one significant figure beyond what is strictly justified by the absolute resolving power of the system. Nevertheless, my own experience contradicts this belief. When I studied analytical chemistry, students were expected to make readings to the nearest 0.01 mL on a burette marked in quantities of 0.1 mL. Neither I nor any student I knew was able to do this with any degree of accuracy, but this inability did not change our instructors’ expectations. While resolution is subject to certain fundamental limitations, some proposed limitations are illusory. Introductory textbooks on cell biology usually report that a light microscope cannot distinguish two points that are separated by a space of less than 0.2 to 0.5 µm. This limit is said to be controlled by the properties of visible light and the mechanism by which images are formed in a microscope. A 35-mm slide of a microscopic image projected at high magnification does not yield any additional information; such magnification is said to be “empty.” Before accepting this limitation, however, we need to define what we mean by the term distinguish. In light microscopy, resolution can be discussed in terms of Abbe’s specification for the limit of a diffraction-limited microscope; or according to Lord Rayleigh’s criterion concerning the spatial relationship of the diffraction patterns; or in relation to the Sparrow limit, which refers to the spatial frequency in which the modulation transfer function becomes zero. Moreover, modern techniques such as video-enhanced microscopy allow objects to be “seen” that theoretically could not be resolved by applying Rayleigh’s criterion.11 Taking the changing values in microscopic resolution as a cautionary caveat, let us consider two examples of instrument resolution in dentistry: oral radiology and periodontal probing. Dental radiographic films have an estimated resolution between 14 and 20 line pairs

153

Brunette-CH12.indd 153

10/9/19 11:12 AM

THE IMPORTANCE OF MEASUREMENT IN DENTISTRY

(lp) per mm, which is equivalent to a pixel size of 25 to 36 µm; images from a charge-coupled device (CCD) with a pixel size of 40 µm have an estimated resolution of 12.51 lp/mm,12 but some high-resolution systems claim 26 lp/mm.13 Conventional periodontal probing has a resolution of 1 mm (or 0.5 mm if interpolation is used); electronic probes can resolve to around 0.1 mm. However, it is possible that other contributors to measurement variability (such as angulation of the probe and probing force for periodontal probes) may make a greater contribution to error than that imposed by the resolution of the instrument.

Precision Precision refers to the distance that lies between repeated measurements of the same quantity. Murphy14 notes that the concept of precision is usually applied only to those measurements in which random variation is at least comparable in size with the limit of uncertainty of measurement. For such a measurement, the precision is not indicated by the number of significant figures in the result, but rather by multiple determinations of what purports to be the same quantity. The standard deviation (SD) often serves as the criterion for the precision of a result; less frequently it is indicated by, among others, the variance, the mean deviation, the range, the standard error (SE), and confidence intervals. A measurement is said to be precise if there is only a small spread of numbers around the average value. Yet the definition of precision cited above implies that repeated application of a given measuring technique to a given system will yield results that apply to a single statistical population. Mandel15 argues that such an assumption is not supported by a detailed examination of actual measuring processes in which populations are nested within populations. This complication is typically ignored. One reason investigators seek to make their measurements as precise as possible is to increase the possibility of detecting effects. The true score theory of measurement states that every measurement—ie, every observed value—can be considered to consist of two components: (1) the true value, and (2) an error. Observed value = true value + error. The various types of error—random, systematic, the result of observation, etc—are discussed in chapter 13. For now, it should be noted that in the absence of error, the observed value would equal the true value and therefore would be perfectly reliable. Alternatively, if the

measurement did not actually measure anything present in the sample, then the observed value would merely reflect the error and therefore be perfectly unreliable. A relationship between the variances (to be discussed later) of these three elements can be described as follows: Variance (observed value) = variance (true value) + variance (error). For values obtained from groups of individuals, the variance of the true value may be understood as being derived from the real differences between individual members of the group. As a general rule, a clinical measurement of an individual contains two major sources of error: (1) biologic variation that results from variation over time, and (2) analytic variation of the measurement technique. When these components are independent of one another, we do not need to consider the covariance between them, and the total variability of the observed values expressed as the SDtotal can thus be described as: SDtotal =

2

2

SD biologic + SDanalytic

For example, the pH of saliva in an individual involves the measurement of many factors that vary over time, and each measurement of pH involves the use of a pH meter, a technique that could be affected by analytic error. The section on analysis of errors using analysis of variance (ANOVA) (see chapter 13) presents one approach to sorting out the relative importance of different factors. In designing an investigation, it is important to recognize the futility of worrying about analytic variability if biologic variability is great. Klee16 demonstrates that the reduction in total variability is minimal even when the analytic variability is reduced below 50% of the biologic variability. Measurements are often reported to have greater precision than the resolving power of the technique. Many students of periodontics are confused to see measurements of probing depth reported in papers to the 0.01 mm since they know that periodontal probes are marked only to the nearest 1 mm. In such instances, the authors are combining individual measurements to make a statement about a group of measurements. The mean of a series of pocket measurements is a corporate property; it refers to an aggregate, not an individual measurement.

154

Brunette-CH12.indd 154

10/9/19 11:12 AM

Operational Definitions

Accuracy

Validity

A measurement is accurate if its performance on average is close to the true value to be determined. The criterion for accuracy is the absence of bias. Among other possibilities, bias can be contributed by the observer, the subject, the instrument, and the environment. Part of the skill of being a competent investigator involves determining how such biases can be detected and removed. When assessing the accuracy of a measurement, it is often the case that the true value cannot be determined. Some measurements can be related indirectly to a reference value; for example, measurements of length made with a yardstick could be expressed in terms of the international standard (eg, the wavelength of the orange-red line of krypton-86). Some reference values can have assigned values arrived at by common agreement of a group of experts.15 In biology and medicine, the problem of obtaining reference values is particularly acute because the measuring technique often significantly affects the value. A good example is the technique used to take blood pressure: Because an apprehensive subject can give an inaccurate value, there is said to be an art to taking blood pressure measurements so that the subject relaxes. In the physical sciences, several approaches have been taken for estimating accuracy. In one, investigators calculate the contribution of all systematic errors that could affect the measuring process to estimate the amount of error in their measurements. This approach, based on scientific judgment, requires experience of the behavior and properties of the materials and the instruments as well as access to previous measurement data, calibration data, and so forth. Another technique is to compare the results with those obtained by different and independent measuring processes in the plausible belief that our confidence in the value is increased if several measuring methods unrelated to each other yield the same value.11 For well-established techniques, uncertainty estimates would be based on standard statistical techniques such as estimating parameters of a curve, ANOVA, and SDs. The ISO/TS/TS 21748:2005 defines accuracy as “the closeness of agreement between a test result or measurement and the true value.” In this definition, accuracy is the property of a single measurement result, and it includes the effects of both precision and bias. Bias can be measured using a suitable reference material, such as a reference material certified by a standards organization, or a spiked test sample, or a reference method.

Validity expresses the relationship between what a test is intended to measure and what it actually measures. Validity can also be thought of as the extent to which a measure predicts something important about the object measured. For example, there are a number of methods for estimating the amount of plaque on teeth, some of which involve staining the plaque with a dye. However, because the dyes most commonly used (such as basic fuchsin) also stain the overlying pellicle, the measurements of stained areas correlate poorly with the weight of plaque present. Thus, the validity of the staining procedures as an estimate of the amount of plaque present is dubious. In this instance, the disparity between intended and actual measures is related to the physical-chemical aspects of the measurement. But the validity of some techniques of plaque measurement could also be questioned on clinical grounds. From the standpoint of the initiation of periodontal disease, the area of greatest concern is the deposition of plaque close to the gingival margins, and yet the various methods of scoring plaque do not always emphasize this fact. For this reason, Schick and Ash modified the plaque-measuring portion of the Ramfjord Periodontal Disease Index (PDI) to restrict the scoring to the gingival half of the tooth surfaces.17 Another example from periodontics is the validity of attachment-level measurement, the question being how reliably the coronal level of the connective tissue attachment can be determined by clinical probing. In the discussion of accuracy, it was noted that comparing the same property using different techniques is sometimes of value. Predictive validity refers to the estimated value of a new measuring technique relative to that of an existing instrument or technique that is already validated or highly accepted—ie, a gold standard. For continuous data, a correlation coefficient for the standard and the new measurement can be calculated. For example, the McGill Pain Questionnaire (MPQ) is a standard reference for the measure of pain. However, some investigators prefer to use a less cumbersome and more convenient method, such as a visual analogue scale. If an experiment on pain is performed and a visual analogue scale is used as the method of measuring pain, the results would be accepted more readily if a high correlation could be shown between the scores recorded by the new test and those from the MPQ. The section on specificity and sensitivity in chapter 15 shows the calculations used to evaluate the diagnostic performance of new tests relative to a gold standard (ie, the best available test to determine if disease is

155

Brunette-CH12.indd 155

10/9/19 11:12 AM

THE IMPORTANCE OF MEASUREMENT IN DENTISTRY

Black distribution

50

60

70

White distribution

80

90

100 110 120

130 140 150

IQ Fig 12-1 Frequency distributions of IQ for equal-sized populations of blacks and whites. (Adapted with permission from Hernstein and Murray.20)

present or absent). This can be thought of as an example of concurrent validity, ie, a measure’s ability to distinguish between groups that it should theoretically be able to distinguish between.18 Research in education has necessarily had to consider the validity of tests measuring student performance. Frey19 has outlined several measures of test validity: • Face validity. At first glance, the test must appear to measure what it is intended to measure. This is simply a matter of human judgment and yields no numeric data. A test on oral histology, for example, might be expected to include questions on odontoblasts, ameloblasts, fibroblasts, osteoblasts, and mucosal epithelium, among others. • Content-based validity. This measure concerns how well the test questions cover a certain well-defined domain of knowledge. Validity at this level generally involves some organized method of selecting and forming questions. In a test of knowledge of oral biology, professors are normally required to include various components, such as the histology, biochemistry, microbiology, and neurophysiology of the oral cavity. • Construct-based validity. This measures how well the score on the test represents the characteristic that it is designed to measure? Trochim18 defines it as the degree to which inferences can be made from the operationalization in the measurement to the theoretical constructs on which the operationalizations were based. In measuring intelligence quotient (IQ), for example, the operationalization might be a question like the following: What number follows next in the series 1, 2, 3, 5, 8, 13 . . . ? In this example,

the theoretical construct would be that one aspect of intelligence is the recognition of numeric patterns. • Consequences-based validity. This measure focuses on the effects of taking the test on those taking it and whether the test is biased against certain groups. For example, there are data reporting that the distribution of IQ scores is such that black Americans score significantly lower than white Americans (Fig 12-1). In other words, the mean IQ score for black people is lower than the mean IQ score for white people. In the controversial book, The Bell Curve, Herrnstein and Murray20 used this data as well as other assumptions or data—including the relationship between economic success and IQ and antisocial behavior and IQ—to recommend a broad range of social policies. The publication of Herrnstein and Murray’s book resulted in an abundance of critical reviews, many of which criticize of the quality of the data as well as interpretations of the bell curve (see, for example, Jacoby et al21). Several critiques are based on the validity of the IQ test; ie, does it really measure intelligence? For example, one explanation of the difference is that black people typically had a lower socioeconomic status (SES) than white people, and it is possible that IQ tests were measuring social background rather than intelligence. This could be tested by plotting SES of parents versus the IQ of the children to see if nurture rather than nature had some role in IQ scores. In fact, IQ scores go up with SES . . . but for both black people and white. So although it does not explain the IQ difference between black people and white people, the finding does indicate that IQ measurement may include some component of home environment. Contrarily, it could also be argued by proponents of the IQ test that the higher SES parents were in fact smarter than the lower SES parents and that their native intelligence was passed on to their children. Another approach to criticizing IQ test validity has been to look closely at the construction of the tests. Richardson22 suggests that IQ testing is primarily a reformatting exercise in which ranks in one format (teacher’s ratings) are converted into ranks in another format (IQ scores). This reformatting is accomplished by selection from the pool of items used in the construction of the test. Moreover, the test items are also selected so that the distribution of results mimics the characteristics of physical characteristics, such as height or strength, that is, they are distributed normally, forming a bell curve. There are also concerns about the conditions surrounding the application of tests that purport to

156

Brunette-CH12.indd 156

10/9/19 11:12 AM

Types of Scales

measure intellectual abilities in persons whose race or gender are stereotypically associated with low performance in the particular abilities being measured. For example, if African-American students and European-American students were given a test described as “Diagnostic of intellectual ability,” the African-Americans’ performance was found to be significantly lower than the European-Americans. However, if the test was described as “a laboratory-based problem-solving task that was not diagnostic of ability,” the performance of the African-American students improved. The authors concluded that framing the test as one of intellectual ability caused the African-American students to worry that their performance would confirm a negative stereotype of their group, and their performance suffered. This phenomenon has been labeled as stereotype threat.23 Stereotype threat has also been invoked to explain the achievement gap in mathematics between men and women. The strength of the effect has been questioned, and there is suggestive evidence that the effects have been inflated by publication bias.24 In any case, if researchers adopted the consequences-based criteria for validity, IQ scores would not be considered valid, because it appears to discriminate between racial groups.

Types of Scales Scales are required for making comparisons between measurements. The hierarchy of measurement levels ranges from nominal scales that categorize observations through ordinal and interval scales to ratio scales that have an absolute zero and a fixed interval of measurement. The information content is relatively low for nominal scale data in comparison with ratio scale data, and different statistical approaches are required to make comparisons. Typically the statistical approaches for nominal scale data are less sensitive than those used for ratio scale data. Because each level of scale adds some new information and statistical sensitivity, it is generally preferable to use the highest level of measurement possible and not to convert data on a higher-level scale to a lower one. For example, there would be a loss of sensitivity if we were to convert ratio scale data to rank values and then compare groups using a nonparametric test.

Nominal (or qualitative) scale A nominal scale is used primarily for systematic classification (eg, gender, race, hair color). While a number may be assigned to each class to simplify recordkeeping, it would be chosen arbitrarily and would carry no value. A nominal scale can be used to sort objects or individuals into groups according to gender or hair color (red, brunette, blonde), for example. The key to sorting by nominal scales is to construct well-defined categories. In most instances, sorting people by gender is fairly clear-cut; a person is either male or female. Sorting by hair color is more difficult. Would you place a person with auburn hair in the brown- or red-haired group? Should peroxide blondes be classified with natural blondes, and what of people who have their hair streaked? A good nominal scale has a sufficient number of well-defined categories to prevent objects from being put into inappropriate groups. Ideally, independent observers will sort the objects into the same groups; however, this ideal is difficult to achieve. For some conditions such as premalignancies, the groups are not well-defined, leading to widespread disagreement among oral pathologists when diagnosing this condition. Nominal scale data are analyzed using nonparametric tests. In contrast with parametric tests (eg, t test), which usually assume the data are sampled from populations that follow a normal (Gaussian) distribution, nonparametric tests are used to analyze nominal scale data (and occasionally interval or ratio scale data) that depart from the expected distributions. (Distributions of data are discussed in more detail in chapters 10 and 11.) For present purposes, it should be noted that some nonparametric tests rely on binomial or other non-Gaussian distributions; perhaps shockingly, when the number in a sample reaches a certain threshold (eg, > 30), a normal approximation to a binomial distribution might be used. The tests used most often for nominal scale data and the criteria that favor their use are presented in Table 12-1.

Ordinal scale An ordinal scale is used to analyze data when increasing amounts of the measured characteristic are associated with higher values (Table 12-2). Hence, an ordinal scale both sorts and orders subjects or objects. In theory, ordinal scale data should not be averaged. For example, some components of the various indices used to assess periodontal disease rely on ordinal scale data. However,

157

Brunette-CH12.indd 157

10/9/19 11:12 AM

THE IMPORTANCE OF MEASUREMENT IN DENTISTRY

Table 12-1 | Nonparametric tests commonly used with nominal scale data Name of test

No. of groups

Matched/ Unmatched

No. of cells

Comment

χ2

2

Unmatched

Total should not be < 20; expected (or theoretical) value in all cells should not be < 5

Assigns approximate P value; makes calculation easy Values in cells should be counts (not %) Cells must be mutually exclusive and exhaustive Can be used with observed data and calculated expected values, or theoretical model as a “goodness of fit” test

Fisher exact

2

Unmatched

Used when expected value in any cell is 2

Unmatched

Compares means of more than two samples when (a) only ordinal data are available, or (b) when a one-way ANOVA would be used for parametric data but the underlying distributions are far from normal

Freidman two-way ANOVA

>2

Matched

because of how it is defined, a plaque score of 3 on the Ramfjord PDI cannot be added to a plaque score of 1 to obtain an average of 2: a score of 1 means plaque is present on some but not all scale surfaces; a score of 3 means plaque extends overall and covers more than half of the surfaces. A score of 3 does not indicate three times as much plaque as a score of 1. Therefore, in theory at least, certain common statistical tests that calculate means and variances (ie, parametric tests) should not be used to analyze such data. In theory at least, statistical comparisons of groups rated by ordinal scales should be performed with nonparametric techniques, and sophisticated techniques for analyzing data from periodontal examinations have been proposed.25

Analogous to two-way ANOVA for parametric data. Independent variables (treatments) are the columns; rows are matched groups or repeated measures on individuals

Nevertheless, despite considerable doubt about equal intervals between the categories, it is common practice to summarize subjective ordinal scale data, calculate means, and apply the usual parametric tests. Fleiss et al26 have investigated the degree to which the distribution of values of several periodontal indices in a sample of patients approximates a normal distribution, as well as the capacity of these indices to detect treatment differences. They concluded that the distribution of whole-mouth means was too skewed to warrant the application of parametric statistical analysis. However, they found that the square root of the whole-mouth means appeared to have a near-normal distribution and the power to detect treatment

158

Brunette-CH12.indd 158

10/9/19 11:12 AM

Types of Scales

Table 12-3 | Some statistical tests commonly used with interval and ratio scale data Name of test

No. of samples

Matched/ unmatched

Comment

z score

1

Unmatched

Examines differences between a sample and a population

t test

2

Unmatched

Compares two means Assumes both samples randomly derived from normal populations with equal variances, but considered robust enough to withstand considerable departures from these assumptions

Paired t test

2

Matched

Assumes that differences in paired values follow a normal distribution; if paired values correlate, a paired test will be more powerful than two independent sample tests

One-way ANOVA

>2

Unmatched

Determines whether significant differences exist between means of the samples; subsequent tests are needed to find differences between given pairs of means Assumes normal distribution and equal variances for the samples but is fairly robust to departures

Two-factor ANOVA randomized block

>2

Unmatched

Similar to pairing, it enables assessment of interaction between factors as well as primary effect of factors Assumes the effect of block is a fixed factor

effects. They recommend using the square root of the whole-mouth mean as the preferred transformation in clinical trials of antigingivitis agents. Similarly, Labovitz27 has shown that in appropriate circumstances, certain ordinal statistics can be exchanged with interval statistics. To achieve this end, he recommends assigning as limited a scoring system as possible based on available evidence and to use all available categories rather than collapsing them into just two or three. In using ordinal scales, the investigator assumes that the measured objects can be placed in some type of order, although this is not always possible. The so-called Bo Derek scale of rating attractiveness (10 is a perfect score) assumes that beauty can be ranked. However, Somerset Maugham tells us that beauty is in the eye of the beholder. If Maugham is right, the Bo Derek scores would be unreliable and uninterpretable. A second major consideration in using ordinal scales is the number of categories to construct. If the number of categories is small, the scale will be crude and unable to detect subtle differences. Conversely, if the number of categories is large but each one is not well defined, there will be poor agreement between observers.

Interval scale An interval scale sorts and orders objects in the same manner as an ordinal scale, but in addition it uses a fixed unit of measurement that corresponds to some

fixed quantity of a particular characteristic (eg, °C or °F) (Table 12-3). The problem associated with interval scales is linearity. It might be said that happiness is related to income; the more you earn, the happier you are. However, Fortune magazine reported that the relationship is not linear; rather, happiness increases as the cubic root of income. Thus, if you double your income you will be only 26% happier. Therefore, you would need to earn eight times your present income to be twice as happy as you are now.28 (This explains why some dentists work only 4 days a week. The 25% increase in work between a 4-day and a 5-day week would yield only an 8% increase in happiness.) Consequently, though it has a fixed unit of measurement, income is not a valid quality for measuring happiness on an interval scale.

Ratio scale A ratio scale is an interval scale with a known point of origin (zero point) (see Table 12-3). As the name implies, a ratio scale allows us to compare values directly. For example, Kelvin (K) is a ratio scale because it is zeroed at absolute zero, the temperature at which there is no molecular motion. Thus, 200°K is twice as hot as 100°K, and °K is used in the gas law equations. In contrast, Celsius (C) is an interval scale and could not be used directly in the gas law equations; 20°C is not twice as hot as 10°C. Periodontal probing indices are a ratio scale because some pockets measure 0 mm.

159

Brunette-CH12.indd 159

10/9/19 11:12 AM

THE IMPORTANCE OF MEASUREMENT IN DENTISTRY

A problem associated with ratio scales is that some qualities have a value of zero when some amount of the quality is present. Suppose that a 10-question quiz on the timing of tooth development was given to a graduating dental class. Because 3 years would have elapsed since the class studied oral histology, it is possible, indeed likely, that some individuals would score zero. Yet it is highly unlikely that those individuals know nothing about oral histology. If some other questions were asked, the individuals who scored zero might answer at least one or two questions correctly. Moreover, we could not be sure that individuals who scored 10 on the test knew twice as much as those who scored 5. Thus, although examination results look like ratio scale data, in reality they are not. It should be evident that when ordinal measurements are made, the results can be expressed only in terms of greater than or less than. More precise statements can be made on an interval scale, and effects can be compared via differences. For example, the difference between a bathwater temperature of 50°C and one of 30°C is twice as great as the difference that exists between two baths that measure 20°C and 10°C, respectively. Ratio scale measurements can be compared directly.

Units The units of measurement—that is, the size of the scale intervals—are arbitrary. Historically, the arbitrary determination of units caused major problems in trade because of the variability of measures used in different towns and countries. One proposed solution to this problem was to devise universal measures based on the perfection of nature. The length of a meter was intended to equal one ten-millionth the distance from the North Pole to the equator as measured along a segment of a meridian (in France), so established as to exclude all that is arbitrary29 (at least in the eyes of its French proponents). Nevertheless, convenience dictates that the size of some units (and hence the scale intervals) should remain arbitrary. For example, the temperature interval of 1°C differs from that of 1°F. Units can be defined only if the set of standards used is additive (as with a 1-m ruler, which comprises 100 cm).

How unit size affects the conclusion Obviously, a unit that is too coarse may not be capable of detecting differences. A ruler graduated only in 1-m lengths would not be a practical tool for measuring the height of individuals. Pearce30 has recommended that for the proper use of ANOVA techniques, the unit of measurement should not exceed one-tenth of the total range encountered in the experiment. Wilson31 states that it can be shown mathematically that substantial gains in efficiency can be made by reducing the scale interval to a fraction, perhaps a third or a fifth, of the SD of the population being measured. Practical considerations may also apply. In periodontics, Glavind and Löe32 measured all surfaces of 1,530 teeth and showed that the method error for periodontal probing was less than 0.5 mm. Because this amount of error was not considered clinically significant, the markings of 1 mm on the probe were considered to be appropriate. Strange as it might seem, it is possible to obtain data that are more precise than the units of the scale originally used in the measurements. It is not unusual to see reports in the periodontics literature of probing depths or level of attachment expressed to the nearest 0.1 mm. Because the marks on a periodontal probe are only by millimeter, this might seem impossible. However, you can demonstrate this effect yourself by making two sharp pencil marks 7.5 cm apart. Now measure the distance to the nearest centimeter. By placing a ruler, marked in centimeters, with no finer markings (ie, no numbers between the centimeter markings), randomly between the points (ie, do not always start the measurement on zero or any other predefined point), sometimes you will obtain values of 8 cm and other times 7 cm. By averaging many values, you could not only obtain a value closer to the true value of 7.5 but also have an SE of less than 0.1 cm—which would be less than one-tenth of the markings on the ruler! The actual limit to this approach is set by the systematic error in the markings of the ruler; that is, by measuring the distance millions of times, you could still not gain accuracy in the range of micrometers.31

Wrong units Difficulties also can arise from use of the wrong scale of measurement. In studying the effects of nutrients on growth, for instance, it has been found that the logarithm of weight produces a better variate than the

160

Brunette-CH12.indd 160

10/9/19 11:12 AM

References

weight itself; this is because a change in nutrition does not immediately make an animal larger or smaller but may well lead to a change in its growth rate. Hence, some difficulties can be met simply by converting the data into a more useful form. This topic is discussed in greater detail by Pearce30 and Zar.33

Ratios It is common for some values to be reported as the ratio between two measurements. Income is expressed as dollars/year, ATPase activity as moles Pi/hour/mg protein, collagen production as moles/hour per 10 cells, and so forth. In considering complex values, several factors must be taken into account: 1. In dental journals it is not unusual for authors to report the percentage of successes. However, this can lead to pseudo-precision. If the denominator is small, the precision of the percentage is small. For example, a paper on reinjection of local anesthetic into the periodontal ligament concluded that the injection should be done so that a strong back pressure is noted.34 Among the data offered to support this view was the finding that if strong back pressure is not observed, anesthesia is achieved in only 7 of 22 cases, or 32% of the time. Using the table in appendix 2 taken from Mainland,35 we can calculate that if there are 7 successes out of 22 attempts, the 95% confidence limit for this percentage is between 14% and 55%. Thus, it is possible that the majority of patients would experience anesthesia even when a strong back pressure was not obtained. Because percentages with low denominators lack precision, some clinical medical journals publish percentages only for fractions with denominators greater than 50. As far as I can determine, however, most dental journals have not adopted this practice. 2. The accuracy (and/or precision) of the ratio is influenced by both the numerator and the denominator. This means that the ratio will be less accurate than either of the measurements used to construct it. 3. The use of ratios often carries the assumption that a linear relationship exists among the components in the calculation. If this is not so, the comparison may be meaningless. For example, the amount of enzyme in a solution is often measured by incubating the solution with a known amount of substrate for a known time and measuring the amount of product. However, the amount of product is linear only with

respect to the amount of enzyme for a limited range of enzyme concentrations. If sufficient enzyme to react with all of the substrate is already present, the addition of more enzyme will not yield more product. Thus, to compare the relative amount of enzyme present in two solutions, both solutions would have to be assayed in the linear portion of the activity vs enzyme-concentration curve. 4. In appropriate circumstances, the distribution of the ratio of two Gaussian-distributed variables (discussed in chapter 11) can be bimodal. This may lead an investigator to conclude that a population is heterogenous when it is not.36

References 1. 2. 3.

4.

5.

6. 7.

8.

9.

10. 11. 12.

13.

14.

Blackman VH. Botanical retrospect. J Exp Botany 1956;7:ix. MacDonald J. How honest are dentists? Reader's Digest Canada 1998;(Sept). Huysmans MC, Longbottom C. The challenges of validating diagnostic methods and selecting appropriate gold standards. J Dent Res 2004;83(spec no.):C48–C52. Meijer HJ, Stellingsma K, Meijndert L, Raghoebar GM. A new index for rating aesthetics of implant-supported single crowns and adjacent soft tissues. The Implant Crown Aesthetic Index. Clin Oral Implants Res 2005;16:645–649. Fürhauser R, Florescu D, Benesch T, Haas R, Mailath G, Watzek G. Evaluation of soft tissue around single-tooth implant crowns: The pink esthetic score. Clin Oral Implants Res 2005;16:639– 644. Bridgman PW. The Logic of Modern Physics. New York: Macmillan, 1927. Mandel ID. Indices for measurement of soft accumulations in clinical studies of oral hygiene and periodontal disease. J Periodontal Res 1974;14:7–30. Hull PS, Hillam DG, Beal JF. A radiographic study of the prevalence of chronic periodontitis in 14-year-old English schoolchildren. J Clin Periodontol 1975;2:203–210. Blankenstein R, Murray JJ, Lind OP. Prevalence of chronic periodontitis in 13–15-year-old children: A radiographic study. J Clin Periodontol 1978;5:285–292. Wilson EB. An Introduction to Scientific Research. New York: McGraw Hill, 1952:164. Heintzmann R, Ficz G. Breaking the resolution limit in light microscopy. Brief Funct Genomic Proteomic 2006;5:289–301. Folk RB, Thorpe JR, McClanahan SB, Johnson JD, Strother JM. Comparison of two different direct digital radiography systems for the ability to detect artificially prepared periapical lesions. J Endod 2005;31:304–306. Berkhout WE, Verhiej JG, Syriopoulos K, Li G, Sanderink GC, van der Stelt PF. Detection of proximal caries with high-resolution and standard resolution digital radiographic systems. Dentomaxillofac Radiol 2007;36:204–210. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University, 1985:192.

161

Brunette-CH12.indd 161

10/9/19 11:12 AM

THE IMPORTANCE OF MEASUREMENT IN DENTISTRY

15. Mandel J. The Statistical Analysis of Experimental Data. New York: Dover, 1964:103–125. 16. Klee GG. Toward more effective use of laboratory results in differential diagnosis. In: Hamburger HA, Batsakis JG (eds). Clinical Laboratory Annual. New York: Appleton-Century-Crofts, 1982:119. 17. Ramfjord SP. The Periodontal Disease Index (PDI). J Periodontol 1967;38(suppl):602–610. 18. Trochim WMK. The Research Methods Knowledge Base, ed 2. Cincinnati: Atomic Dog, 2001:68. 19. Frey B. Statistics Hacks. Sebastopol, CA: O’Reilly, 2006. 20. Hernstein RJ, Murray C. The Bell Curve: Intelligence and Class Structure in American Life. New York: Free Press, 1994. 21. Jacoby R, Glauberman N. The Bell Curve Debate: History, Documents, Opinions. New York: Random House, 1995. 22. Richardson K. The Making of Intelligence. London: Weidenfeld and Nicolson, 1999:35. 23. Steele CM, Aronson J. Stereotype threat and the intellectual test performance of African Americans. J Pers Soc Psychol 1995;69:797–811. 24. Flore PC, Wicherts JM. Does stereotype threat influence performance of girls in stereotyped domains? A meta-analysis. J Sch Psychol 2014 53:25–44. 25. Zimmerman S, Johnston DA. Non-parametric tests and RiDiTS. J Periodontal Res 1974;14(suppl):193.

26. Fleiss JL, Park MH, Bollmer BW, Lehnhoff RW, Chilton NW. Statistical transformations of indices of gingivitis measured non-invasively. J Clin Periodontol 1985;12:750–755. 27. Labovitz S. The assignment of numbers to rank order categories. Am Sociol Rev 1970;35:515–525. 28. Seligman D. Keeping up. Fortune International 1988;229. 29. Alder K. The Measure of All Things: The Seven-Year Odyssey and Hidden Error that Transformed the World. New York: Free Press, 2002:89–93. 30. Pearce SC. Biological Statistics: An Introduction. New York: McGraw Hill, 1965:56. 31. Wilson EB. An Introduction to Scientific Research. New York: McGraw Hill, 1952:251–254. 32. Glavind L, Löe H. Errors in the clinical assessment of periodontal destruction. J Periodontal Res 1967;2:180–184. 33. Zar JH. Biostatistical Analysis. Englewood Cliffs, NJ: Prentice Hall, 1974:236–242. 34. Walton RE, Abbot BJ. Periodontal ligament injection: A clinical evaluation. J Am Dent Assoc 1981;103:571–575. 35. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963:358–363. 36. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University, 1985:204–209.

162

Brunette-CH12.indd 162

10/9/19 11:12 AM

13 Errors of Measurement The Normal Law of Error stands out in the experience of mankind as one of the broadest generalizations of natural philosophy • It serves as the guiding instrument in researches in the physical and social sciences and in medicine agriculture and engineering • it is an indispensible tool for the analysis and the interpretation of the basic data obtained by observation and experiment William J. Youden

I

n experimental biologic research, error can be viewed as the result of any factor that affects the results in a manner that is not precisely known to the experimenter. Even in disciplines as precise as physics or chemistry in which results are the product of physical measurement, the numeric value obtained depends on the accuracy of the experiment that measured it. There is no such thing as the exact value of a physical constant.2 Given this uncertainty, results should be presented as a range of values that will fit the experimental data. In physical science this is sometimes done implicitly, by conforming to the rules of significant figures, or explicitly, by a calculation of the probable sources of error and their contribution to the total error in the experiment. In biologic science, the data may often be expressed as a mean plus an estimate of variability, which is determined by statistical techniques. Statistics, as the old joke goes, means never having to say you’re certain, and one common means of presenting data is to use a confidence interval (CI). Whatever the means used, the intent is the same: to give the reader some estimate of the probable accuracy and precision of the data. The following sections give some of the background information to assess errors of measurement.



Everybody Firmly believes in it [the law of errors] because mathematicians imagine that it is a fact of observation, and observers that it is a theorem of mathematics.” HENRI POINCARÉ 1

163

Brunette-CH13.indd 163

10/21/19 11:02 AM

ERRORS OF MEASUREMENT

Precise

Imprecise

a

b

c

d

Fig 13-1 (a to d) Analysis of precision and accuracy of 7-mm markings on various periodontal probes.

40 No systematic error (accurate)

30 20

Frequency

10

40 Systematic error (inaccurate)

30 20 10 6.6 6.8 7 7.1 7.4 to to to to 6.7 6.9 7.3 7.6

6.6 6.8 7 7.1 7.4 7.7 to to to to to 6.7 6.9 7.3 7.6 7.9

Millimeters

Precision Versus Accuracy There is a distinction made between precision and accuracy. Precision refers to the dispersion of values around the measure of central tendency used; eg, a measured value that had a small standard deviation (SD) or CI would be considered precise. In contrast, a measurement is said to be accurate if the result, expressed as a range of possible values, includes the “true” value. For example, Winter3 used a Boley gauge, which was accurate to 0.1 mm, to examine the markings on various manufacturers’ periodontal probes (Fig 13-1). He found that Williams probes manufactured by Hu-Friedy under an old process were both imprecise and inaccurate. At the 7.0-mm mark of the probes, the true length ranged from 6.8 to 7.9 mm with the largest number of probes in the range of 7.4 to 7.6 mm. The process used by another manufacturer yielded a product that was inaccurate but precise: 10 of the 13 probes examined fell in the 7.4 to 7.6 mm group and the others in the 7.1 to 7.3 mm group. Finally, the process used in the new Williams probes manufactured by Hu-Friedy

produced accurate and precise probes. Thirty-five of the 42 examined were accurate at the 7.0-mm mark, and the total range was only 6.8 to 7.3 mm. When there is little random or systematic error, the results are clustered around the true mean, eg, data on new Williams probe (see Fig 13-1a). The presence of random error increases the variability of response around the true mean, thus reducing the precision (see Fig 13-1b). Systematic error displaces the clustered result values away from the mean, making them inaccurate (see Fig 13-1c), and when both errors are present, confusion reigns, for the results are neither precise nor accurate (eg, data on old Williams probe; see Fig 13-1d).

Types of Error Error can be classed as being determinate or indeterminate. These will be discussed separately, but first the inevitability of some error in every experiment will be emphasized.

164

Brunette-CH13.indd 164

10/21/19 11:02 AM

Types of Error

Error as the result of observation In any experiment there is always the possibility that the method chosen for observing the object may in some way alter it. In extreme cases, when the method of observation grossly alters the object, the result is called an artifact. The indeterminacy principle is perhaps the most sophisticated example of error as the result of observation. The indeterminacy principle was first postulated in quantum mechanics by Heisenberg, who noted that for the subatomic particle, the electron, it was impossible to know accurately both its position and its momentum at the same time. To prove this, he showed that any conceivable instrument that could accurately measure the electron’s position would affect the electron’s momentum and vice versa. The indeterminacy principle has been the center of a major methodologic controversy about whether uncertainty is inherent in nature or results from the imperfect state of our knowledge and methods. But you do not have to look as far afield as quantum mechanics for examples of observation affecting the quantity being studied. To take an example from dentistry, Shaw and Murray4 state: Some of the differences in diagnosis observed between the first and second examinations may be due to systematic changes introduced by the examination method. The use of a probe to test for the presence or the condition of the gingiva inevitably alters the environment and can affect the scores of a second examination. Similarly, the use of disclosing dyes to measure the deposition of plaque has been questioned because it is possible that the dyes inhibit plaque growth. Although physicists have debated the indeterminacy principle at great length, biologists have not as a general rule concerned themselves too much with this sort of issue. An exception would be Hillman,5 who has submitted some of the common techniques used in biochemistry to a critical analysis. This analysis consisted of the following: 1. Describing all the steps in a procedure 2. Examining the agents in each step that might influence the final answer 3. Identifying the assumptions necessarily implied in the use of the procedures 4. Discussing their validity

5. Suggesting control experiments to analyze quantitatively how each step affects the conclusions of experiments Hillman6 concluded, “At the moment biochemistry is in a state of uncertainty because elementary control experiments for complex procedures have never been done.” I have never seen Hillman’s theoretical analysis of the problems inherent in standard biochemical techniques refuted, but it has made little impression on biochemists. It seems that biochemists adopt a more pragmatic and less theoretical approach. The approach is simply to use different means or methods of observing the same object and to determine whether the different techniques corroborate each other. For example, early electron microscopists tried different ways of fixing, dehydrating, and staining biologic material. Their argument was that if you can see similar structures even though different procedures had been used, it was likely that the structures existed prior to the treatments. Although each procedure might produce some kind of artifact, it was unlikely that the different procedures would produce the same artifact. Hence the ordered structures observed were considered to be real. Similarly, electron microscopy and histochemistry have, on the basis of biochemical study of isolated subcellular fractions, confirmed some of the inferences made by biochemists.

Determinate errors Determinate error results when one part of the observing system or the design of the experiment is inaccurate or defective. In principle, determinate errors can be removed. I divide determinate error into several classes.

Systematic Systematic errors are the same for each observation. Measurements made with the old Hu-Friedy Williams periodontal probes described in Winter’s study would have a systematic error. Systematic error can be the result of chemical reagents that contain impurities or even choice of method. Some methods consistently overvalue or undervalue the quantity measured. In clinical studies, systematic error can be introduced by any of the biases that cause the study population to be selected or examined nonrandomly in the target population. For example, partial-mouth scores are often

165

Brunette-CH13.indd 165

10/21/19 11:02 AM

ERRORS OF MEASUREMENT

Times selected in thousands

700 600 500 400 300 200 100 10

20

30

40

50

Numbers chosen

Fig 13-2 Distribution of numbers chosen for Lotto 6/49.

used to estimate the prevalence of periodontal disease, but such estimates systematically underestimate the prevalence of the disease.7

Personal Some errors are attributable to the idiosyncrasies of the person making the measurement. Wilson8 states that almost everyone displays number prejudices, which markedly influence the frequency with which the different digits occur in the estimation of tenths of a division on a scale. The distribution of numbers chosen for Lotto 6/49 (a Canadian national lottery game in which one tries to choose the 6 numbers that are drawn out of 49 available) provide evidence for number preferences. The numbers 7 and 25 are popular, while few people select 10, 20, 30, 39, or 40 (Fig 13-2). Such is the magnitude of these personal number preferences that statisticians can formulate betting (ie, number selection) strategies that guarantee success over the long (well, very long) run. Personal errors can also enter into techniques where, on the surface at least, there would appear to be little chance of idiosyncrasies influencing the results. For example, in counting blood cells with a hemocytometer, it is not unusual for there to be consistent differences between operators.

There is also the possibility of bias, either conscious or subconscious, influencing the results. Bias can be demonstrated in another example from hematology. In red blood cell counts with a hemocytometer, the random standard error of observation under the conditions usually used has been estimated at 8% to 10% of the observed count. Laboratory technicians used to be told erroneously during their training that the error is much less (≈1.4%), and this value was used to set a standard of agreement that was required to be reached before the worker’s results were considered reliable. This resulted in the setting of an impossible standard of precision. Because the standard was one that could not be reached with accurate counting, the technicians learned to count very rapidly and to make unconscious adjustments to ensure that all counts agreed with the first.9 In this example, it was possible to analyze data to prove the existence of bias, but in subjective measurements such as those often performed in a clinical setting, the problem of unconscious bias is even greater. Modern experimental designs, such as the double-blind procedure, have been utilized in an attempt to minimize the contribution of personal bias, but in some instances even elaborate precautions are not successful.

166

Brunette-CH13.indd 166

10/21/19 11:02 AM

Types of Error

Assignable causes and blunders An experimenter may overlook relevant variables. If the values of these variables change, the results will be affected. For example, if you were investigating the relationship between pressure and volume using gases at different temperatures, the results might fluctuate wildly. If you learned later that temperature affected the relationship between pressure and volume, the experiment could be repeated at constant temperature. Moreover, if the relationship of temperature to the other variables was known, you could calculate how the temperature affected the original measurements and remove that source of error. Hence, the original error could be assigned to a specific cause. I suspect that one particularly common type—but sometimes difficult to detect—source of error is the blunder. A blunder is a failure to follow a protocol or properly execute a procedure. Graduate students perform much of the research that forms the basis of publications, but graduate students are often inexperienced in procedures and can make mistakes. When graduate students find unexpected or unusual results, experienced supervisors normally question them closely on the procedures they followed because it is more likely that the graduate student made an error than that a previously published study or existing lab protocol was erroneous.

Importance of determinate errors In terms of effective criticism of scientific papers, the detection of determinate errors is often more important than statistical issues. This importance arises because determinate error can result in errors that are often much larger than those arising from random fluctuations. A number of examples of instances where systematic error must have occurred are listed in a monograph published by the National Academy of Sciences on the need for critically evaluated physical and chemical data. For example, two independent measurements of a value of a property of atomic nuclei differed by 25%, ten times the uncertainty estimated by those making the measurements.10

Indeterminate or random errors Even when all known relevant variables are controlled, and the method of measurement is the same, it is

generally found that the values obtained for similar samples vary. These variations are the result of a number of uncontrolled variables, each of whose effects is individually small; such variations are called random errors, and these are dealt with by statistical methods, which provide a means for estimating the variability of results and minimizing the chance of making false conclusions about them. Statistical techniques are generally used only after a reasonable effort to reduce determinate error has been made. Ordinary statistical procedures are not applicable if the errors are not random.11

The normal law of error The usual analysis of data in experimental biology assumes that the indeterminate errors are distributed according to the normal law of error. This “law” of error has been the subject of some confusion as to whether it is an empirically derived rule or a theoretical construct. Both views are supported. It has been found experimentally that the normal or Gaussian distribution adequately describes systems in which the measurements under study are affected by a very large number of errors all acting independently. However, as stated by Wilson,11 a theoretical argument for the normal law exists that is based on the rapid approach to normality, which can be demonstrated mathematically to occur when the error is due to the sum of a number of independent causes, each cause being distributed in any arbitrary manner but having a finite standard deviation. It appears safe to use the normal law for observations in which it is clear that four or five or more sources of error enter with about equal weight. Random error affects measurement of a variable across all the members of a sample. Random error thus increases the variability of the distribution around the average for the sample.12 Particularly in engineering and the physical sciences, random error is sometimes described as the “noise” in the system that interferes to a greater or lesser extent with the meaningful information found in the “signal.” It could occur for some measurements that one source of error was much greater than the others. In such instances, the usual approach would be to attempt to control the major source of error either experimentally by, for example, changing the conditions of measurement, or statistically by covariance analysis. For some data (such as the distribution of incomes) the Gaussian distribution does not apply. (In the case of incomes, Pareto’s law applies; there is a very long

167

Brunette-CH13.indd 167

10/21/19 11:02 AM

ERRORS OF MEASUREMENT

1 cm 7 cm

Marksman A

4 cm

Marksman B

Fig 13-3 Galton board. A series of binomial events (ball deflecting either left or right with 50% probability for either direction at each level) will yield a symmetric curve approximating a normal or Gaussian distribution. The paths of two balls are traced, one of which was deflected to the left at every level, and another that had the more likely outcome of some deflections to the left and some to the right.

Fig 13-4 The method of least squares. Two marksmen, A and B, have the same average deviation of the distance from where their bullets hit to the center of the target (4 cm). Most people though would think that marksman B was the better rifleman because his shots are scattered over a lesser area of the target. By the laws of geometry, the area of the target is related to the square of the radius from the center.

tail on the right-hand side because of the existence of such persons as oil sheiks and oral surgeons.) In such instances, special statistical procedures might be applied. Incidentally, even in such nonnormally distributed populations there is a way of getting a precise estimate of the mean. According to the central limit theorem (one of the most important theorems in statistics), if a population has a finite variance σ2 and mean µ, then the distribution of the sample mean m approaches the normal distribution with variance σ2/n and mean m as the sample size n increases, regardless of the distribution of the parent population. The normal distribution as well as establishing confidence limits for means are discussed in chapter 11. These concepts can be demonstrated by a physical analogue: the Galton board (named after the pioneer

statistician Sir Francis Galton), also known as quincunx or bean machine (Fig 13-3), which can be constructed by hammering equally spaced and interleaved rows of nails on a board and dropping metal balls, such as BB shots, through a funnel into the array of nails into a series of bins that collect the balls as they leave the array. As a ball hits each nail, it has an equal probability of deflecting to either the right or left. The number of balls that accumulates in the bins will resemble a normal distribution where the number of rows of nails corresponds to the number of trials, and P = 0.5. If there are 15 rows of nails, the location of each ball is the sum of 15 random variables (ie, randomly determined direction of deflection in each of the rows). A consequence of the central limit theorem is that the sum of n random variables approximates a normal distribution when n is large. With 15 rows of

168

Brunette-CH13.indd 168

10/21/19 11:02 AM

Types of Error

nails one does get a distribution of balls in the bins that resembles a normal distribution. At www.youtube. com/watch?v=UCmPmkHqHXk there is a graphic demonstration of the Galton board for 3,000 balls; you can watch this demonstration to convince yourself empirically of the truth of the central limit theorem.

Method of least squares Given that measurements are often distributed according to the normal or Gaussian distribution, the question arises as to the best means for determining a representative value for a set of observations and the associated error. According to Mellor,13 the Belgian army once had a regulation in which the ability of riflemen was evaluated by adding up the distance, regardless of direction, of each man’s shots from the center of the target. The man who had the smallest sum won the “le grand prix” of the regiment. However, this rule is clearly faulty; if one shooter A scored a 1 and a 7 and another shooter B scored two 4s, the Belgian army’s rule would rank them equal though most people would believe that the second shooter was the better marksman. Mellor points out that the reason one thinks the second shooter is more accurate is because his shots fall into a lesser area (Fig 13-4); that is, the marksman’s error is not related to the magnitude of the straight line of where the bullet hits to the target but to the area of the circle described about the center of the target with that line as the radius. Intuitively it seems that because the area of the target is proportional to the square of the radius from the center that one might want to use some measure involving the sum of the squares of the distance of each shot from the center. That is, in fact, the case. Legendre’s method of least squares states that “The most probable value for observed quantities is that for which the sum of the squares of the individual errors is a minimum.” Gauss himself came to the same conclusion possibly even earlier than Legendre, but he only published his work in 1809.14 In modern terminology, the sum of squares of deviations is called the residual sum of squares or error sum of squares. The method of least squares has wide application. It is used, for example, in calculating the best estimate for values of physical constants determined by different methods, and for determining the best fit of experimental data to a straight line (regression) where the best values for the slope and the intercept are calculated. 14 The mathematics can be quite complex; the general objective is to obtain a formula for a parameter that provides the best estimate, ie, produces a minimal least squares difference, which

will be called pe. To do this one first produces equations specifying the difference of observed values to the (at this point undetermined) parameter. The resulting function is then differentiated with respect to pe. The laws of calculus tell us that the value at which pe is a minimum will be that where the derived function is zero. Equating the derived expression to zero enables one to solve, ie, produce a formula for pe. Applying the method of least squares, it can be shown that the best value to represent a number of observations of equal weight is their arithmetical mean.13–15

Confidence intervals CIs indicate the precision of the sample that resulted in the estimated value for the population. The wider the CI, the less the precision. CIs are now expected in a number of medical journals, but formulae for the calculations are not readily available in standard statistical textbooks. This book includes appendix 2, which gives binomial population limits expressed as percentages. Appendix 3 contains critical values of the t distribution that can be used to construct CI as outlined in chapter 11. However, CIs for a number of other important statistics are not covered in this book, such as regression coefficients, correlation coefficients, relative risks, odds ratios, survival data, and nonparametric analyses (eg, medians). Fortunately, there is a specialized and accessible text16 that provides the necessary information.

Error of combined measurements A common problem is that of estimating the error correctly when independent measurements are combined. This consideration becomes particularly important in evaluating derived measurements when the variables are related by other than simple relationships. It can be shown that when the derived quantity u is some function of j independent variables, ie, u = f (v1, v2, . . . vj): j

s2u = ∑

i=1

2

∂f s2v i ⎝ ∂ vi ⎠

where su2 = variance of u, and s2v = variance of i variable i. This relationship is called the law of propagation of errors.2 The law can be reliably used when the errors are reasonably small (10% or less) with respect to the

169

Brunette-CH13.indd 169

10/21/19 11:02 AM

ERRORS OF MEASUREMENT

measured values. Mandel14 states that the law is exact for linear combinations of random variables but only approximate for nonlinear combinations.

A simple case: Sums or differences Consider an experiment where the quantity of interest is simply the difference between two measurements. This type of calculation occurs in periodontics when comparing probing depths before and after treatment: f ( v 1 , v 2 ) = v 1 – v2 Propagation of errors theorem gives us: s2u = s2v1 + s2v2 which, in terms of SD of the difference between measurements, becomes: su =

s 2v1 + s 2v 2

Hence in this simple case, it is found that because the squares of the SD of the measurement of each variable are combined, a large error in one component tends to overshadow small errors in the other component. Note that the SD of the difference (or a sum) is less than the sum of the SDs of the components.

Ratios Ratios are commonly used in biologic science; for example, enzyme activity is often expressed as a specific activity—eg, units/milligram protein or tooth support as ratio of length of tooth root adjacent to total alveolar bone/root length. The function of a ratio may be described as: u=

v1 v2

Applying the propagation of errors theorem, noting that:

∂u ∂ v 1 = and u = – 12 ∂v 1 ∂v 2 v2 v2 v 12 2 1 2 s + s v v 24 v2 v 22 1

s2u =

2

dividing by u 2 =

v1 ⎞ ⎝ v2 ⎠

2

2

2

s v1 sv 2 ⎞ su = + ⎝u⎠ ⎝ v1 ⎠ ⎝ v2 ⎠ As the coefficient of variation:

(CV) =

sv sv su , for v 1 = 1 , for v 2 = 2 u v1 v2

The relationship becomes: 2

(cv ) u

2

2

= (cvv1 ) + (cvv2 )

Thus, for ratios, the errors compound in proportion to the sum of the fractional (or percentage) error, or the square of the relative error in the function equals the sum of the squares of the relative errors of the component functions.

Standardization and errors in calibration Studies in clinical research sometimes require assessment of exposure to a chemical and/or determination of a biomarker that may predict or influence disease. The amounts of exposure or concentration of a biomarker may vary widely, and there will be a need to devise an assay that can determine concentrations over a range of values. A common practice for the calibration of an assay to determine the concentration of biochemicals, for example, is to produce a standard or calibration curve. In this procedure known amounts of the biochemical are assessed typically by reaction with some chemicals or substrate and then measured by a physical technique such as fluorimetry or optical densitometry at a specific wavelength. The results are then plotted with optical density, for example, on the ordinate and concentration on the abscissa. Then a straight line that provides the best fit to the points is drawn either by eye or more rigorously calculated using regression

170

Brunette-CH13.indd 170

10/21/19 11:02 AM

Types of Error

Table 13-1 | Sample ANOVA table Source of variation Between patients

Degrees of freedom

Sum of squares

Mean square

19

190

10

1

7

7

Random error

19

1.9

0.1

Total

39

198.9

Between examiners

and the method of least squares; this line is known as the standard curve. Unknowns, such as samples of serum from a human population, are then measured in the same way, and their optical density (or fluorescence or other biomarker) is recorded. The amount of the measured value for the sample is then used as the ordinate and located on the standard curve. The concentration of the unknown is determined by the corresponding value of the abscissa. Although this straightforward procedure is performed regularly in laboratories, the problem arises when one wants (or is required as in, for example, legal disputes where exposure to a chemical might be of paramount consideration) to determine the errors. There are really two types of errors involved: those that relate to the measurement of the sample itself and those that result from the regression used to produce the standard curve. Typically, lab workers ignore the error due to regression, but rigorous assessment requires that the error associated with regression be calculated; however, this complex matter, discussed by van Belle17 and Mandel,14 is beyond the scope of this book.

3. Measurement error. This source of variation results from all the factors that tend to produce differences when measuring the same phenomena. Examples include fluctuation in line voltages affecting electrical equipment, different technicians performing the same assay, and stability of reagents. In clinical studies, examiner error is often considered separately from other measurement errors. Analysis of errors may be described in different ways, but an especially informative reporting strategy is to present an analysis of variance (ANOVA) table. For example, in papers analyzing measurement techniques in clinical dentistry, the sources of variance that are often of interest are patients, examiner effects, and errors of measurement. Suppose there were a study incorporating 20 patients and two examiners who made one measurement each on each patient. An analysis of variance table might look like Table 13-1, and the components of variance would be estimated as outlined by Fleiss and Kingman.18 For patients, the variance: s2p =

Analysis of errors by analysis of variance Measurements in clinical research, as well as basic research using animals, are usually subject to three major sources of variation. 1. Variation between individuals. This variation, sometimes called true biologic variation, is caused by all the factors that make individuals differ, such as age, sex, and genetic factors. 2. Variations with time. Some properties of individuals can vary from hour to hour or from one day to the next, and a field of study, chronobiology, is devoted to understanding such variations.

mean square patients – mean square error no.of examiners s2p =

10 – 0.1 = 4.95 2

For examiners, the variance: s2x =



mean square examiners – mean square error no.of patients ⎠

s2x =

7 – 0.1 = 0.17 40

For error, the variance:

s2E = mean square error = 0.1.

171

Brunette-CH13.indd 171

10/21/19 11:02 AM

ERRORS OF MEASUREMENT

This information on variance can be used to estimate the reliability of the measurements, as well as to plan experiments. A quantity that summarizes the relative magnitudes of the estimated components of variance is the intraclass correlation coefficient R: R=

s2x 2 s x + s2E

Values of the intraclass correlation coefficient close to one indicate excellent reliability (for this to happen s2X + s2E must be small relative to s2x), and conversely, values close to zero indicate poor reliability, as the variance attributed to examiners relative to random measurement error is relatively large. The purpose of a reliability study is not to test hypotheses but rather to assess the characteristics of the measurements and the relationships between them. Such studies enable an assessment of the reliability of the data, not a determination of whether differences exist between groups. In some situations, the value of component variances cannot be assessed directly, but can be deduced through appropriate calculation and experimental design. This topic, variance components analysis, is discussed by Box et al.19 Moreover, by looking at the various components of variance, we can evaluate and improve experimental design. In this case, it is clear that the major source of variation is between the patients. Thus, to use this system of measurement to look for effects of a certain treatment, it would not help the investigator much to refine the calibration of the examiners, who are not adding significantly to the error. It might be useful, however, to try and select patients from a more homogenous group so that the variance attributed to patients could be reduced.

References 1.

2. 3. 4. 5. 6. 7.

8. 9. 10.

11. 12. 13. 14. 15. 16.

17. 18. 19.

Poincaré H. Thermodynamique. Paris: G. Carré, 1892. Quoted by: Mellor JW. Higher Mathematics for Students of Chemistry and Physics. London: Longmans, Green, 1909:515. Leaver RH, Thomas TR. Analysis and Presentation of Experimental Results. London: Macmillan, 1974:5. Winter AA. Measurement of the millimeter markings of periodontal probes. J Periodontol 1979;50:483. Shaw L, Murray JJ. Diagnostic reproducibility of periodontal indices. J Periodontal Res 1977;12:141. Hillman H. Certainty and Uncertainty in Biochemical Techniques. London: Surrey University, 1972:ix. Hillman H. Certainty and Uncertainty in Biochemical Techniques. London: Surrey University, 1972:114. Kingman A, Morrison E, Löe H, Smith J. Systematic errors in estimating prevalence and severity of periodontal disease. J Periodontol 1988;59:707. Wilson EB. An Introduction to Scientific Research. New York: McGraw-Hill, 1952:233. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963:155–156. National Academy of Sciences. National Needs for Critically Evaluated Physical and Chemical Data. Washington DC: National Academy of Sciences, 1978. Wilson EB. An Introduction to Scientific Research. New York: McGraw-Hill, 1952:246. Trochim WMK. The Research Methods Knowledge Base, ed 2. Cincinnati: Atomic Dog, 2001:90. Mellor JW. Higher Mathematics for Students of Chemistry and Physics. London: Longmans, Green, 1912:498–566. Mandel J. The Statistical Analysis of Experimental Data. New York: Dover, 1964:131–159 Wilson EB. An Introduction to Scientific Research. New York: McGraw-Hill, 1952:226–229. Gardner MJ, Altman DG, Machin D, Bryant T. Statistics with Confidence: Confidence Intervals and Statistical Guidelines, ed 2. Hoboken: Wiley, 2013. Van Belle G. Statistical Rules of Thumb. New York: Wiley, 2002:12– 127. Fleiss JL, Kingman A. Statistical management of data in clinical research. Crit Rev Oral Biol Med 1990;1:55. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenters Design, Innovation, and Discovery. New York: John Wiley & Sons, 1978:571–582.

172

Brunette-CH13.indd 172

10/21/19 11:02 AM

14 Presentation of Results

Ideals and Objectives As noted previously, a scientific paper is an attempt to persuade the reader of the truth of the author’s conclusions. A main component of persuasion is the presentation of evidence. Tufte,1 perhaps the most influential figure in contemporary graphic design,2 has argued that the point of displaying evidence is to assist the thinking of the producer as well as the consumer of the information. The common experience of practicing scientists suggests that Tufte’s concept is true. Lab meetings or student supervisory committee meetings almost always involve some sort of pictorial display of the evidence or the concepts; sometimes these are outlined on blackboards, whiteboards, or paper, and sometimes they are projected using digital displays and sophisticated computer programs. It seems that scientists cannot engage in discussion without using images and diagrams, which seem inherently central to scientific reasoning. But the effectiveness of scientific images varies considerably. Tufte has suggested that the standard of quality of evidence can be judged on three main criteria: integrity, quality, and relevance.3 Another standard of quality is more subjective: beauty.

Integrity



A colleague of Galileo, Federico Cesi, wrote that Galileoʼs 38 handdrawn images of sunspots �delight both by the wonder of the spectacle and the accuracy of expression.ʼ That is beautiful evidence.” EDWARD TUFTE 1

The amount of faith that readers of scientific articles place in the integrity of the authors is striking. Few readers will be able to repeat the experiments, and, consequently, most readers are forced to accept as true the raw evidence displayed in a paper. Readers must assume that the fields of view presented in micrographs are representative or typical, and are not special occurrences. They also must assume that the data have been obtained in the manner described in the materials and methods section—and that the data have not been selected or excluded so as to conform to the hypotheses of the authors.

173

Brunette-CH14.indd 173

10/21/19 11:15 AM

PRESENTATION OF RESULTS

a

b

c

Fig 14-1 (a) Clinical situation showing three teeth with caries. (b) “Photoshop restoration,” in which a lesion has been restored using digital Photoshop manipulations. (c) Definitive restoration with three Photoshoprestored teeth in situ.

The advent of digital photography has brought the issue of integrity to the forefront, as authors now have considerable ability to generate a virtual reality that can mislead readers. The manipulation of photographic images has a long and not always honorable history. In a lecture to the Royal Society in 1865, a pioneer in the photography of mental patients claimed that photography “makes them observable not only now but forever, and it presents also a perfect and faithful record.”4 The limitations of photography in providing perfectly faithful records soon became apparent, for it was discovered that both the visual habits of the photographers and available practical techniques influenced the images.4 Darwin illustrated his study of the appearance of human emotions with photographs. However, rather than presenting raw data of people experiencing emotions, the photographs were staged with actors, in effect, giving their impressions of the emotions.5 Dental records, in particular, have been subject to manipulation. One famous case concerned a University of British Columbia (UBC) professor on sabbatical in Switzerland, whose wife went missing. Police located a body that had been cut up, placed in green garbage bags, and thrown into a ravine. In an attempt to identify the body, the police asked the professor for his wife’s dental records. When they looked closely at the records he provided, police discovered that they had been crudely altered. The professor explained that he had altered the records so that he would not have to face the possibility that his wife was actually dead. Unconvinced by his explanation, the police brought the case to trial, where the professor benefited from the Swiss system that—besides the traditional categories of guilty and not guilty—has an intermediate result: not guilty by reason of

doubt.6 The professor walked away a free man, but it is rumored that UBC students were less convinced of his innocence, and referred to the professor as “the man from Glad” (referring to an actor who advertised green garbage bags). Regardless of his guilt or innocence, by altering the image, the professor cast doubt on his integrity and caused himself much trouble. The black magic marker has proven to be a versatile instrument for scientists wishing to alter reality, reportedly being used for such diverse purposes as adding bands to images of polyacrylamide gels and to spot-painting mice to mislead observers into thinking that transplants of black mouse skin to white mice had been successful.7 Modern digital techniques improve on the traditional magic marker approach and can be employed to modify images in a manner that can be impossible to detect. Figure 14-1 shows a virtual restoration of a decayed tooth that my colleague Dr Babak Chehroudi completed using Photoshop. A fraudulent dentist could use such techniques to fool insurance companies into paying for procedures that were never done. Similarly, in science, an investigator could generate images to represent phenomena that did not take place, or modify images to provide support for a central argument. Some journals request access to the raw camera-format images to preclude the possibility that the original observations have been altered. More subtle compromises of integrity can occur using the tools of rhetoric outlined in chapter 4. The selection and manipulation of data can be guided by the rhetorical principle of cognitive response to direct readers to certain conclusions. This direction can be done stealthily by such methods as choosing indices and scales of graphs to magnify effects (Cialdini’s contrast

174

Brunette-CH14.indd 174

10/21/19 11:15 AM

Ideals and Objectives

The quality of data and illustrations presented in scientific papers can be difficult to assess and can involve technical considerations, numeric considerations, and close examination of results.

states, “Beautiful Evidence is about how seeing turns into showing, how empirical observations turn into explanations and evidence.” Sir John Herschel (1799–1871) was an acclaimed scientist in several fields, including scientific photography. Schaaf summarizes Herschel’s concept of beauty: “for him [Herschel] the real beauty of photography lay less in science facilitating the making of images, but in the capacity to reveal the truths behind science.”11 Inherently, the ideals of absolute objective verisimilitude and informative clarity can conflict, a contest that tends to be resolved according to the purpose of the illustration. In botanical illustration, for example, any one plant may not show all the features of interest in a single illustration. Similarly, in producing images for practical purposes such as distinguishing between poisonous and harmless mushrooms, it might be essential to show both the cap’s surface and underside, which might be impossible under normal lighting conditions. In artistic illustrations, a series of conventions were adopted such as showing animals in their natural environments (even when the animals themselves were pure fantasy—the Thomas Fischer Rare Book Library at the University of Toronto has an illustration of an enormous humansized lobster attacking a man).12 Other conventions included the illustration of both the total organism and its details, transparency (showing the exterior and the interior of the subject simultaneously), showing time series of development, and depiction of the same subject from multiple points of view.12 Some of these conventions of artistic illustrations were carried over into modern photographic documentation, and some new ones enabled by technical advances, such as staining by multiple specific antibodies on a single specimen to show relationships in molecular distributions and three-dimensional (3D) reconstructions of optical sections obtained by confocal microscopy, were also introduced.

Beauty

Technical quality

Beauty may seem like a strange subjective quality to look for in allegedly objective scientific illustration, but it is important for both the demonstration and impact of findings. In Beauty in Photography, the photographer Robert Adams states, “The point of art has never been to make something synonymous with life but to make something of reduced complexity that is nonetheless analogous to life and can therefor clarify it.”10 And that principle is behind Edward Tufte’s Beautiful Evidence, in the introduction of which Tufte1

Assessing data quality can entail considerable technical expertise. Based on his experience, Lawson13 has identified no less than 32 common faults in photomicrography. For instance, photomicrographs should meet certain technical standards, such as image sharpness, even illumination, adequate resolution, sufficient contrast, brightness, and proper color balance. When an illustration from a scientist’s article is selected for the cover of a scientific journal, it is often the mastery of technique that is rewarded rather than significance of

principle), or placing arrows or other indicators to highlight some features present in particular micrographs, while obscuring other features. The choice of measurement used by the investigator can pose another concern for integrity. For example, in the comic movie Borat, the provocateur/protagonist tells members of the Veteran Feminists of America that women cannot benefit from education, because their brains are small, like squirrels’ brains. The group protests the conclusion, but, in fact, women’s brains are smaller than men’s. However, this difference disappears when correction is made for body size. That is, large men and large women have large brains, and small women and small men have small brains; the difference between the sexes is accounted for by the generally greater size of men relative to women. But the measure of brain size as an indicator of intelligence is inherently flawed and unreliable, and relates back to the discredited pseudoscience of phrenology. For example, Einstein’s brain was rather ordinary in size. The 19th-century anatomist Paul Broca, who championed the idea of the relationship of brain size to intelligence, probably would have been saddened by the finding— discovered after his death—that his own brain (at 1,424 g) was not particularly large.8 The problem with the theory that brain size influences intelligence is that the organization of the brain, rather than its size, determines cognitive ability.9 Thus, any study using brain size as a surrogate for intelligence would lack integrity, because the concept has been shown to be false.

Quality

175

Brunette-CH14.indd 175

10/21/19 11:15 AM

PRESENTATION OF RESULTS

the result. In any case, one’s work being chosen for the cover can cause some puffing out of scientific chests, often accompanied by the appearance on lab walls of framed reproductions of the scientific masterpiece.

Relevance concerns the desirability of having information in context. Two major elements are comparison and mapping.

information is required to put the salary number into context. The two major types of comparisons made in scientific papers are: (1) comparisons between different sets of observations, and (2) comparisons between a set of observations and expectations. Specific tests are available for assessing such comparisons. For example, probably the most common statistical test for demonstrating the statistical significance of differences between sets of observations is the t test, as might be used when comparing the value of some outcome variable between a treated group and a control group. A goodness of fit test, such as the χ2 test, would be commonly used to compare observed versus expected frequencies for categorical data (see chapter 10). In a scientific paper, the author decides which comparisons will be featured, and there is a temptation to choose those that make the study look important. For many clinical studies of treatments, there are at least two positive possibilities: (1) that the treatment improved some outcome variable with time (eg, the effect of flossing over time on the amount of interproximal plaque or gingivitis), and (2) that one treatment was superior to another (eg, dental flossing might lower interproximal plaque values relative to no treatment or treatment with a mouthrinse). In a study’s original design, the authors might have hoped that the new treatment was better than the currently accepted treatment, but if they failed to demonstrate statistical superiority of the new treatment (ie, if the statistical test resulted in accepting the null hypothesis of no difference between treatments), they might resort to the lesser claim that the treatment at least worked to some degree, based on a comparison of the values before and after treatment.

Comparison

Mapping

There is nothing so lonely as an isolated statistic. If the only fact about my income in my final year of employment was reported as $182,000 (Canadian) per year (UBC professors, being public servants, have their salaries published), the reader would immediately think of different comparisons to put the salary information in context. What is the average salary of someone living in Vancouver? What is the purchasing power of that salary in American dollars? What does an average UBC professor make? What does the average burnt-out professor make? The exact question that the reader might ask would be related to a specific purpose. No matter whether it be to establish that I earned enough to be able to eat or determine my status (as determined by salary) among other professors at UBC, additional

Tufte14 has stated that scientific images should nearly always be mapped, contextualized, and placed on a universal grid. In mapped pictures, representational images are combined with scales, diagrams, overlays, numbers, and words. Such mapped images can facilitate comparisons and enable explanations. In essence, a mapped picture provides the detailed, specific, and perhaps unique information present in an image (eg, a photomicrograph) in combination with the abstract, focusing, and explanatory power of a diagram. A scale bar gives a universal standard of comparison so that readers can appreciate the actual dimensions involved. An advantage of mapping is that the relevant information is presented within one visual field. Some journals insist on scale bars for micrographs, whereas

Numeric quality Numeric data also incorporate indicators of quality, such as reporting variation (eg, small standard deviations [SDs] or error bars and the like), resolution (eg, the number, spacing, and distinctiveness of protein bands on a polyacrylamide gel), and accuracy, which can be demonstrated through incorporation of standards (eg, molecular weight standards for polyacrylamide gels separating molecules on the basis of size).

Consistency Often, different tables or figures contain values that are generated under identical conditions. For example, the value obtained for an untreated control sample might appear in several graphs that report the results of different treatments. Ideally, the values from repeated observations should be close to each other; contrarily, a lack of repeatability could cause a reader to suspect that some factors in the experimental methods were not well controlled.

Relevance

176

Brunette-CH14.indd 176

10/21/19 11:15 AM

Ideals and Objectives

Dynamic Interactions Between Cells and Surfaces Aim 1: Cell interaction with novel surfaces, stable population boundaries

Cell encountering surface

Surface physical cues Topography (eg, grooves) Chemistry (eg, Ti) absorbed proteins, growth factors, etc

Altered cell shape and cytoskeleton Fibroblast alignment and elongation

Aim 3: Integrated tissue interface Cell-aligned collagen fibrils

Feedback

Modified microenvironment

Modified surface, cell-cell interactions

Cytokine/ chemokine secretion

Aim 2: Macrophage plasiticity

Epithelial actin filament alignment

Shape-mediated

Cell signaling

Tissue attachment or detachment

Cell-aligned fibronectin fibrils

Cytomechanical ECM organization

Fibroblast multilayers

ECM secretion and turnover

Motility (eg, contact guidance)

Proliferation

Alterations in cell functions

Macrophage cytokines/ chemokines on SLA

Protein secretion

Differential gene expression

Fig 14-2 Dynamic interactions between cells and surfaces.

others allow the magnification to be stated in the figure legend. Having the scale bar directly on the micrograph simplifies comparisons for the reader, who otherwise would have to measure features in the micrograph and divide by the magnification to arrive at the size of a given structure. An example of using mapping as an organizational aid is shown in Fig 14-2, taken from one of my grant applications that was funded and rated as the best in those rated by the Biomedical Engineering Committee at the Canadian Institutes of Health Research in that competition. In brief, reviewers of grants face a complex problem; typically, the grants have several aims and use multiple techniques. Immersed in this mass of detail, reviewers, in my view, require help in seeing the “big picture.” This figure outlines the processes involved as a cell interacts with an implant surface. Moreover, it shows the particular processes being studied in the three aims of the grant. It will be recalled that feasibility is one of the main evaluative criteria

used by grants committees, and they want to be sure that the applicant can actually perform the types of experiments proposed. Thus, although the resolution of the pictures in the figure is not optimal, it is sufficient to show that the lab had performed experiments using various techniques, such as gene expression, cytokine measurement, electron microscopy, and immunostaining, that were proposed in the application. Moreover, at a glance the reviewer can determine what techniques were being used to pursue which aim. In a way, this figure resembles a tourist map that has main districts and streets labeled but also includes small pictures of the buildings or historic sites in the district. Another example of mapping is taken from a talk I gave on my lab’s research in Gothenburg (Fig 14-3). In this instance, the map is a metaphor for the various lines of investigation. The lines are named with the collaborators involved, and the stations are named after the student or postdoc involved. Therefore, it reminded me to name the proper persons in the proper

177

Brunette-CH14.indd 177

10/21/19 11:15 AM

PRESENTATION OF RESULTS

Fig 14-3 Strategy map.

places. In addition, like a real map, it noted the districts of research approach through which the lines passed.

The Selection and Manipulation of Data Because neither the exact length nor the form of a scientific paper is always strictly specified, an author is relatively free to decide (within some constraints that vary among journals) how much and what data to include as well as how the data are to be presented. In making these choices, authors have an opportunity to make their case as convincing as possible. This section examines some of the strategies and standards for reporting and presenting data. In reporting lengths, an author is usually constrained to employing the international standard: the meter. But in other situations, there may be more than one way of looking at the data; they may be presented as raw data (just the way they were recorded), as averages, as a ratio of some measured value relative to some standard, or as a ratio relative to some point of

time or control. Although there may be many legitimate ways of looking at the same set of data, authors will normally choose a way that makes the data look most convincing. In fact, authors can show extraordinary ingenuity in presenting their data in the best light. To illustrate several of the ways of looking at data, consider the following hypothetical set of data from an experiment investigating the effect of a mouthrinse on oral malodor. The chemical measurement made is the amount of volatile sulfur compounds (VSC) in nanograms per milliliter of mouth air. In this hypothetical experiment, the mouth air is analyzed for two individuals (A and B) prior to treatment and at 1, 2, and 3 hours afterward. The results are compared with a control treatment, which comprises rinsing with distilled water. The data could be processed in several ways, including the following: 1. The absolute values of VSC (Table 14-1). If replicated, the data could then be analyzed by analysis of variance (ANOVA) or other sophisticated statistical tests for significance.

178

Brunette-CH14.indd 178

10/21/19 11:15 AM

The Selection and Manipulation of Data

Table 14-2 | Decrease of VSC (ng/mL) from baseline with time

Table 14-1 | Total VSC concentrations (ng/mL) Individual

Treatment

Baseline

1h

2h

3h

Individual

Treatment

2h

3h

Water

–2

1h

–1

0

Mouthrinse

–8

–7

–5

A

Water

10

8

9

10

A

A

Mouthrinse

10

2

3

5

A

B

Water

4

3

4

4

B

Water

–1

0

0

B

Mouthrinse

4

0.8

1.2

2

B

Mouthrinse

–3.2

–2.8

–2

Table 14-3 | Relative levels of VSC with time Individual

Treatment

Baseline

A

Water

A B B

Table 14-4 | Percent reduction of VSC levels with time

1h

2h

3h

100%

80%

90%

Mouthrinse

100%

20%

Water

100%

75%

Mouthrinse

100%

20%

30%

Individual

Treatment

Baseline

1h

2h

3h

100%

A

Water

100%

20%

10%

0%

30%

50%

A

Mouthrinse

100%

80%

70%

50%

100%

100%

B

Water

100%

25%

0%

0%

50%

B

Mouthrinse

100%

80%

70%

50%

Table 14-5 | Ratio of treated to control VSC concentrations

Table 14-6 | Average VSC levels (ng/mL) after treatment Mouthrinse

Water

A

25%

33.3%

50%

A

3.33

9

B

26.7%

30%

50%

B

1.33

3.67

Individual

1h

2h

3h

Table 14-7 | Average VSC levels (ng/mL) after treatment Mouthrinse 2.33

Water 6.34

2. The decrease from the baseline reading (Table 14-2). 3. The amount of VSC relative to the baseline for each individual (Table 14-3). Note that the difference between individuals has disappeared in this treatment of the data, because the baseline for each treatment is set at 100%, and the relative response with time is similar. 4. The percent reduction from baseline (Table 14-4). 5. Disregarding the baseline, and expressing the values in the mouthrinse-treated group relative to the water control (Table 14-5). 6. Data could be averaged for all times after treatment for each individual (Table 14-6). 7. The data could be averaged for all times and individuals (Table 14-7). 8. As in point 7, only expressed as a percentage. Thus, all the data could be reduced to a single statement: “The VSC concentrations in the mouthrinse-treated individuals were 37% of those where water was used,” or, alternatively, “Mouthrinse reduced VSC 63%.”

Individual

9. The author could choose just one time point for a comparison and write, “One hour after treatment, the mouthrinse decreased VSC 80%.” 10. Another approach would be to define as objectionable any person whose breath had a VSC concentration greater than five. Using these data, it could be said that mouthrinse totally eliminated any objectionable odor from 100% of the people tested for at least 3 hours (not an impressive result, as one of the two individuals was not “objectionable” prior to treatment). 11. Suppose that a third individual, C, had been tested, and the mouthrinse did not affect the VSC in C’s mouth air. An unscrupulous author might simply ignore the results. A more creative approach would be to designate C a nonresponder, and then bury somewhere in the materials and methods section the statement that only responders were studied. The beauty of this approach is that, by setting the definition of who is a responder at an appropriate level, the author could manipulate the data to produce a more impressive result. This list is not exhaustive. Several other ways of treating the data are available. Any of the above methods of treating the data (save the dishonest ones) might be suitable, depending on the arguments an author

179

Brunette-CH14.indd 179

10/21/19 11:15 AM

PRESENTATION OF RESULTS

might want to make and the relative importance of the particular data set to the author’s conclusions. There are two major points on the author’s treatment of data. First, an author selects the data to be presented. The data that appear in scientific papers are condensed from the numbers that fill lab notebooks. Many journals specifically ask referees to point out how a paper could be shortened to reduce the cost of publication. Moreover, because it is not reasonable to expect the author to report every piece of data recorded during a study, an author must walk a tightrope, balancing the merits of an economical description with the benefits of complete data reporting without falling into the abyss of intent to deceive. The reader should be aware that possibly not all collected data are presented. Missing data raise the chance that an author is trying to hide a weakness, such as excessive variability or inconsistencies, or to report only the conditions that gave the biggest effect (as could be done in the example if only the 1-hour time point was considered). Second, an author determines how data are manipulated. My rule is that the reader’s suspicion should be directly proportional to the degree to which the data have been processed. A multitude of problems can be hidden through the ingenious use of relative values or corrections for backgrounds or choice of scale dimensions and baselines. The closer the presented data are to the actual observations, the better the chance a reader has of interpreting the data independently. Each manipulation involves an assumption about the underlying process. To return to our example, comparing everything to a baseline is only valid if the baseline is stable. Defining what concentration of VSC is “objectionable” is a value-laden process, as is setting the level of responders or nonresponders. Comparing relative values between individuals would be appropriate only if the response were the same, regardless of the initial absolute level. Presenting only processed data forces the reader to make the same assumptions as the author.

More on derived measures A common way of manipulating data is for an author to devise an index or other derived measurements.

Example 1: Use of ratios to suppress variation

the amount of the enzyme activity varies from batch to batch in membrane preparations. Because of this large batch-to-batch variation, an investigator might be unable to discern small effects in cells from rodents, which are not very sensitive to the drug. By calculating the ratio below, it is possible to demonstrate statistically significant effects of even very low concentrations of the drug: % inhibition =

activity in sample with ouabain × 100 control activity (ie, no ouabain)

For an author, the advantage of this procedure is that the data, freed of troublesome variations, become much cleaner and more convincing. For the reader, the problem with such data is that the reader cannot find out what variation in the original measurements existed, or what caused it. In this example, why did the membrane preparations vary in ATPase activity? The reporting of data relative to a standard value is sometimes informally referred to as normalization. (Note that the term normalization is also used to describe the process whereby data are transformed into a z score, which represents the score’s position on a Gaussian curve.)

Example 2: Use of ratios to mask or enhance changes A second problem with data reported as a ratio concerns the size of the denominator, which may either mask or enhance the perception of any changes. To enhance an effect, ratios are calculated with denominators that are as small as possible. For example, suppose that 91% of people were employed and 9% unemployed when an administration took office, but that during that term of office, the numbers changed to 95% employed and 5% unemployed. This change could be expressed either as an increase in employment: ⎛4⎞ ⎜ ⎟ × 100% = 4.2% ⎝ 95⎠ or as a decrease in unemployment: ⎛4 ⎞ ⎜ ⎟ × 100% = 44% ⎝9 ⎠ The decrease in unemployment is much larger in percentage terms on account of the low value of the denominator.

The cardiac glycoside ouabain inhibits Na+- and K+-activated ATPase in isolated plasma membranes. However,

180

Brunette-CH14.indd 180

10/21/19 11:15 AM

The Selection and Manipulation of Data

Table 14-8 | Success rates for dental implant placement for surgeons A and B Surgeon A

Success rate for smokers

No. of implants placed in smokers

Success rate for nonsmokers

No. of implants placed in nonsmokers

80%

1000

95%

100

Overall success rate (0.8 × 1000) + (0.95 × 100) 1100

B

70%

100

85%

1000

(0.7 × 100) + (0.85 × 1000) 1100

Example 3: Novel indices and cutoff lines A distressing finding of modern sociology is the ineffectiveness of many social programs. These programs are often very expensive, and justifying their existence on pragmatic (as opposed to moral or ethical) grounds is difficult. One approach to the problem is to devise new indices that make the results look better. Rosenthal and Rubin15 devised a simple procedure for converting an estimate of effect size into a tabular display (binomial effect size display [BESD]). Rather than making a statement like, “A special reading program accounts for only 9% of the total variation,” we could say of the same data that it “reduced the need for tutoring by almost one half.” A reader should always look carefully at novel indices, especially when arbitrary cutoff lines like “responders/ nonresponders” or “needs/does not need tutors” are used. Are arbitrary definitions introduced to offer the investigators a means to conclude whatever they want?

Example 4: Ratios of ratios of ratios of. . . Sometimes investigators normalize data to form a ratio and then compare the ratios. The net effect of all this computation is that the reader ends up so far removed from the original data that the reader becomes confused and accepts the authors’ conclusions. For example, a study on the effect of a drug on bone loss in an animal model of periodontal disease compared bone loss by measuring the length of the root surrounded by alveolar bone relative to the total length of the root to give a value of percent bone attachment. This ratio was computed at various times to give a rate that was a ratio of a ratio: % bone loss month Then the drug-treated and control groups were compared to give an effectiveness ratio, which is a ratio of a ratio of a ratio: rate of bone loss with drug treatment rate off bone loss in control

= 81%

= 84%

In this composite value, it is virtually impossible to comprehend what the variation was in the original measurements, or even the actual amount of bone lost.

More on aggregated measurements The Simpson paradox A common statistical artifact called the Simpson paradox refers to the situation in which the aggregated data actually point in the opposite direction to that of the same data when disaggregated.16 Consider Table 14-8, which shows the success rates for smokers and nonsmokers for dental implants placed by Surgeons A and B. Both surgeons have lower success rates for smokers than for nonsmokers. For both smokers and nonsmokers, Surgeon A has a higher success rate—in fact, 10% higher for each. Yet, his overall success rate is less than Surgeon B’s. The reason for the difference is that Surgeon B places more implants in nonsmokers than in smokers, whereas Surgeon A’s patients are mainly smokers, whose implants are less likely to be successful. In this example, we could conclude that careful case selection (ie, limiting the number of smokers) by surgeons may be more important in perceived overall success rate than surgical skill. But the more general theme is that some misleading conclusions can arise when one aggregates data across subclassifications (in this case, smokers and nonsmokers).

Subgroup analysis The obverse of the problems posed by the Simpson paradox is the question of when is it legitimate to break down aggregated data into subgroups to determine whether a treatment had an effect on a subgroup. For example, consider a randomized controlled trial where a drug was given to a group of men and women and its effects compared with those obtained in a placebo-treated group. Suppose that no statistically

181

Brunette-CH14.indd 181

10/21/19 11:15 AM

PRESENTATION OF RESULTS

significant effect was found; however, close scrutiny of the data and separate examination of the response of men and women might reveal that men benefited but women did not. The first question of interest is whether the results simply occurred by chance. If many possible subgroups (eg, sex, physiologic values) and many possible ways of comparing the subgroups exist, it is likely for some difference to arise by chance in at least one subgroup. The most rigorous approach would be to test the finding in a subsequent experiment. A possibly acceptable approach would be to report the difference, particularly if there was a good explanation for the result that could have been postulated in advance (even though it was not). For example, in a study of the effects of flossing, an investigator could postulate in advance that flossing might be least effective for posterior teeth because they are hard to access. An investigator might not consider that possibility in advance of the experiment, but—confronted by data showing flossing improved oral hygiene markedly in anterior but not posterior teeth—might decide to report it as a statistically significant difference. An unacceptable approach would be to make as many comparisons as possible, in the hope that one might turn out to be statistically significant; at the very least, the statistical test should be modified appropriately (such as with the Bonferroni corrections) to correct for the multiple comparisons being made.

Exploratory data analysis Subgroup analysis, in which various relationships are considered, is a subset of the larger topic of exploratory data analysis. This analysis is probably as old as science itself, but its legitimization and enrichment is attributable to the statistician Tukey.17 In explanatory data analysis, an investigator examines the data and looks for patterns. In the breath analysis example given previously, the data were looked at in various ways, such as in ratios or difference from baseline. As noted in the example, some of these operations can mislead the reader, but they also can provide new insights into the data and show patterns that may not be evident in unprocessed data. Exploratory data analysis is one of the more enjoyable scientific activities, for it enables an investigator to gain new insights and typically involves less grunt work than the collection of data. Indeed, a scientist can even use other people’s data, perhaps finding a pattern missed by the original investigators. Finding unexpected patterns is one of the major paths to discovery. As a simple example of exploratory data analysis, consider the data of Simonton18

that records the average age at which scientists in various disciplines made their most important contribution. Simonton19 has published a figure on this (as well as other) data, in which the disciplines are ordered alphabetically (a much simplified version is given in Fig 14-4a). Alphabetical arrangement offers the reader the advantage of ease in finding particular data. However, it does not provide any insight into why the data are what they are—whether some underlying principle or pattern might explain the data. Seeing the data in Fig 14-4a, and remembering vaguely Lord Bertrand Russell’s comment that when he no longer could do mathematics he switched to philosophy, I began to wonder if that pattern also occurred in mathematically based disciplines—ie, is it the case that the more mathematics used in a discipline, the younger the scientists’ ages will be at the time of their most important contributions? First, I ranked the disciplines, according to my crude guess, as to how much each discipline, on average, used mathematics. I then redrew the data with the disciplines spread along the x-axis according to my guess and plotted the age at which the best contribution occurred for each (Fig 14-4b). A rough relationship is evident; the more mathematics involved in a discipline, the earlier the scientist “peaked.” A serious study would require a lot more work; for example, an investigator might have to look at the work of individual scientists, their ages at the time of their most important contribution, how much mathematics was involved, and so forth. But conducting a preliminary exploration of the data and identifying a pattern provide the seeds for a future investigation. Moreover, simply looking at the data in detail brings up other issues that it may inform. For example, at my university, there has been a heated debate on the issue of mandatory retirement at age 65. The administration, of course, wanted to make retirement mandatory at age 65, as this gives them flexibility in their planning and also saves costs (as old professors tend to be more expensive than freshly hired young ones). Faculty members, as a general rule, want mandatory retirement to be eliminated, as this gives them more options. Phrases like “distilled experience” are used to justify keeping the elderly professors, with all their distilled wisdom, on the payroll. When we look at the ordinate of Fig 14-4, however, we realize that for every science, the average age of scientists at the time of their best contributions is younger than 45 years. When we consider the SDs (not given here), it does not appear likely that professors will make their finest research contributions after age 65. Nevertheless, the faculty

182

Brunette-CH14.indd 182

10/21/19 11:15 AM

The Selection and Manipulation of Data

Best contribution 43

42

42 41 40

39

39

38

Age

41 40

38 37

36

36

35

35

0

0

M at he m at ic s Ph ys ic s C he m is tr Te y ch no lo A gy st ro no m G y eo sc ie nc e B io lo gy M ed ic in e

37

A st ro no m y B io lo gy C he m is tr G y eo sc i M at enc he e m at ic s M ed ic in e Ph ys ic s Te ch no lo gy

Age

Best contribution 43

Fig 14-4a Average age at which scientists make their best contribution. Discipline ordered alphabetically. (Data from Simonton.18)

Fig 14-4b Same data as Fig 14-4a. Disciplines are ordered by a crude assessment of the amount of mathematics involved in each field. (Data from Simonton.18)

debate was truly academic; the Government of Canada in its wisdom decided that it was discrimination to force the aged to leave gainful employment at age 65. It will be of interest to determine the productivity and impact of the research of these old souls. Wainer20 makes three points on the use of graphs in exploratory data analysis: (1) Impact is important. Ideally, a graph should be vivid enough that its primary message is inescapable, but (2) understanding a graph is not automatic. A good legend can make a weak graph into a strong one. Wainer believes that a legend should state the point of the graph. Indeed, having to prepare informative legends may reduce the number of pointless graphs in a paper, as it makes the pointlessness of the graph evident. (3) A graph can make points that might not be evident otherwise.

3. There should be some measure of variation (eg, SD, standard error [SE], confidence limits, or range). 4. There should be a statement about the total number of objects studied.

Minimum requirements for reporting data There are four minimum requirements for reporting numeric observations: 1. It should be clear exactly how the measurement was done. 2. There should be some measure of central tendency (eg, mean, median, or mode).

It is difficult to evaluate a study without all of this information. However, even when all these values are reported, the reader might not get an accurate impression of the distribution of the values in the sample. Statistics, like the SD, tell us little about distribution of the values in the sample if the data are not distributed normally. Cleveland21 notes that the normal distribution is symmetric, but real data are often skewed to the right. The normal distribution does not have wild observations, but real data do. The best way to present some sets of data is to show all the data points.

Measurements of central tendency The term population refers to a finite or infinite group of things with common characteristics. The sample is the observed part of the population. The purpose of any measure of central tendency of a sample is to represent a group of individual values. The best measure of central tendency depends on the situation being described; the most common measures are the mean, the mode, and the median.

183

Brunette-CH14.indd 183

10/21/19 11:15 AM

PRESENTATION OF RESULTS

Mode Median Mean

Fig 14-5 Frequency density curve for a moderately skewed distribution in which the three measures of central tendency (mode, median, and mean) occur at different values. For a perfectly symmetrical distribution, they would all coincide.

Mean The arithmetic mean is defined as: mean = X =

1 n

n

∑x i =1

i

=

x1 + x2 + x3 + . . . + xn n

where xi = value of each measurement, and n = number in sample. Of the three measures of central tendency, only the mean uses all the actual numeric values of the observations. The mean uses more of the information in a set of data than either the mode or the median, and is the value commonly chosen in data distributed according to the symmetric normal (Gaussian) distribution—in contrast to the median, which is more useful in instances where the data distribution is skewed. The problem with using the mean as an estimate of the average value is that the mean can be unduly influenced by extreme values. For instance, in salary negotiations with school boards, teachers sometimes become upset when the news media report their average salaries as indicated by the mean. The mean salary for professional employees of a school board is a value that is inflated by the high salaries of administrators and principals. Thus, teachers argue, the mean does not reflect the actual money paid to the teachers who face the children in the front lines of the classroom.

distributions (eg, counts in integers) but is also used with categorical data. In Fig 14-5, the mode corresponds to the peak of the distribution curve. Its use in biologic science is limited, but it can be thought of as being a “typical” example. Although not affected by extreme values, the mode is easily shifted by the accidental accumulation of scores at some point that may be a considerable distance away from the central tendency of the distribution. Occasionally, samples will have two peaks in the frequency-distribution curve. Such a curve is termed bimodal, and the use of the mode can become problematic as it may indicate a nonhomogenous data set. The data on death by horse kick (see Fig 10-7) provide an example in which the mode describes the distribution quite well. In most years, no one was killed, and this statistic is more readily digested than the parameters of the Poisson distribution. The mode is the only measure of central tendency that makes sense for nominal scale data. It makes sense to state that the modal (most frequent) sex of dental students is female, and that their modal hair color is brown, but statements about their mean hair color or median sex would be meaningless.

Median The median of a set of n observations is the middle value when the data are arranged in order of increasing size. If the sample has an even number of observations, the median is the average of the two central values. The median is unaffected by extreme values in the sample. The median is useful, because it can be calculated when some values are missing but are known to be above or below a certain level. For example, the median age of death for a group can be determined as soon as 50% of the members have died. Such a situation often arises in tests of cancer therapies. Calculation of the mean would have to wait until all the members had died. Thus, the median is particularly useful in reporting survival of patients or restorations.

Summary Consider a sample that yielded the values 1, 2, 2, 3, 4, 5, 6: Mean =

Mode The mode is the most frequently occurring value in a set of measurements. It is most often used in discrete

1 +2 +2 +3 +4 +5 +6 = 3.28 7

Mode = 2 (two data points were 2, the rest of the data points occurred only once)

184

Brunette-CH14.indd 184

10/21/19 11:15 AM

The Selection and Manipulation of Data

Median = 3 The relationship between three of the measures of central tendency is given in Fig 14-5.

For a sample from the population, the applicable formula for the SD (which is the square root of the variance) is:

∑ (x n

s=

i =1

Range The range (R) is defined as R = xmax – xmin, where xmax is the highest, and xmin is the lowest values in the sample. The range is an efficient measure of dispersion when there are very few observations in the sample. For this reason, the range has been widely used in quality-control work, where spot checks are made using small samples. The range can be applied to all types of data. Problems with range are that it can be very variable from one data set to the next and is very badly affected by errors that cause outliers. It is generally regarded as a poor method for estimating population dispersion.

Standard deviation SD, the square root of variance, is the most common measure of dispersion. The variance of a sample (s2) is defined as:

∑ (x = n

2

s

i =1

i

–X

)

2

n–1

– = mean, and n = number in sample. where X The number of “independent” observations, represented by n – 1, is called degrees of freedom (df). The value for the df is one less than the total number of observations, because one df is lost when the mean is calculated. That is, if the mean and the values of n – 1 of the observations are known, the value of the nth observation can be calculated. The variance for a population, rather than a sample from the population, is given by σ2, where µ is the true population mean:

∑ (x – μ ) = n

σ

2

i =1

–X

)

2

n–1

Measures of dispersion Evaluation of results often concerns not only the average of a sample but also the dispersion, or spread, of the data around the central value. Once again, it is desirable for the value describing the data dispersion to be as representative of the sample as possible. We will examine several measures of dispersion.

i

Interquartile range Just as the median divides the data into halves, the quartiles divide it into quarters. The interquartile range (IQR) normally refers to the range between the upper and lower quartile, ie, the middle 50%. It is employed in the presentation of Box plots. If the underlying distribution is Gaussian, the IQR can also be used to estimate roughly the SD according to the formula: s≈

IQR 1.35

Standard error A common technique for portraying sample-to-sample variation of a statistic is to graph error bars to portray plus and minus one SE of the statistic, in a way similar to that used when the sample SD is used to summarize the variation of the data. The formula for standard error of the mean (SEM) sX̅ follows: sX =

s n

An SE of a statistic conveys information about the confidence intervals of the statistic; the mean plus or minus one SE is approximately a 68% confidence interval. Thus, the SEM quantifies uncertainty in the estimate of the mean. Variability in the population, the type of information in which the reader is interested, is not directly demonstrated. Thus, Glantz22 recommends that data should never be summarized with the SEM. What makes the SE attractive to many authors, however, is that it is smaller than the SD; thus, the appearance of the variability in the data is reduced because the error bars are smaller.

2

i

n

185

Brunette-CH14.indd 185

10/21/19 11:15 AM

PRESENTATION OF RESULTS

Significant Digits: Conventions in Reporting Statistical Data Published reports often appear to have a spurious degree of precision. Different authorities recommend different conventions. Some believe that, as a general rule, investigators should not pretend to know more than they really do. In this view, they should not report any more digits than can be read, or at least estimated, from the scale that was read. However, the precision of a measurement is given by its SD. Precision is not indicated by the number of significant digits in the result. The actual number of digits reported is to some extent a question of style. Mainland23 suggests carrying two more figures than will be required at the end. In the analysis of the data by ANOVA techniques, Pearce24 recommends expressing the data so that there are three digits that can vary from one to the other. For example, data could take the form: 1.413, 1.536, 1.440, etc, in one case; and 101.413, 101.536, 101.440, etc, in another. In both cases, it is the last three digits that vary among the values. In the ANOVA technique, Pearce advises expressing the summation terms to two decimal places beyond those arising from the act of squaring. Finally, Gilbert25 recommends reporting one more decimal place in the SE than in the mean itself, making it possible to be accurate in subsequent calculations (such as confidence limits). However, Gilbert also states that, in biology, it is rarely worth carrying decimals more than three significant digits in any mean. Few would disagree with Gilbert’s declaration that if someone claims a duck lays 4.603 eggs on average, the final digit, 3, is almost certainly useless.

Tables Tables are used to present exact values of numeric data when the amount of data is too extensive to be summarized in the text. Huth26 recommends that there should be no more than one table or illustration per thousand words of text, and that tables should be used when readers need the exact values of more data than could be summarized in a few sentences of text. Overuse of tables, figures, and illustrations is often symptomatic of an author’s desire to disguise dross as gold. Day27 remarks that:

adds importance to the data. Thus, in a search for credibility, there is a tendency to convert a few data elements into an impressive-looking graph or table. Several commonsense rules are applicable28: 1. When four or more items of statistical information or data are to be presented, the material will be clearer in tabular form. 2. Trends in the data should be exploited to coincide with reader expectations. Time series or concentration are examples of data where readers expect data to be presented with the earliest or smaller values first, and later or larger values last. In some instances, authors design tables based on ease of locating particular data entries. For example, if United States income levels were reported by state, the first row would discuss Alabama, and the last row, Wyoming. However, as noted by Wainer,29 such a presentation may preclude readers from discerning important patterns. 3. The data should be arranged so that the major comparisons are clear. 4. Inessential data should be omitted. More detailed instructions on tabular displays of data and statistics are given in chapter 20 of Lang and Secic.30 Data tables can be used deceptively. Occasionally, tables are used where the reader might expect a figure or histogram. On examining the table, the reader sometimes finds an irregular point in the data that does not conform to the general pattern. In general, it is much easier to detect irregularities in a graph or histogram than in sets of written numbers. Hence, tables must be examined closely to check internal consistency.

Illustrations Illustrations, such as photographs, are necessary when they are the evidence being offered to support a conclusion. Ultrastructural studies provide an example; the electron micrographs are the evidence. In other instances, diagrams or charts summarize and demonstrate the relationship between groups. The pioneer statistician Fisher31 viewed diagrams as follows:

Many authors, especially those who are still beginners, think that a table, graph, or chart somehow

186

Brunette-CH14.indd 186

10/21/19 11:15 AM

Illustrations

Diagrams prove nothing but bring outstanding features readily to the eye; they are therefore no substitute for such critical tests as may be applied to the data but are valuable in suggesting such tests. This view was also held by Hill, who wrote, “Graphs should always be regarded as subsidiary aids to the intelligence and not as evidence of associations or trends.”32 Figures and/or illustrations efficiently provide information, and by virtue of their prominence relative to material presented in the text automatically emphasize the points being made.

Understanding presentation techniques A number of techniques are available to slant illustrations so that they provide support for the author’s views. Several of these are graphic analogues of rhetorical techniques.

Selection The general rule with selection seems to be that authors do not show what they do not want the reader to see. “Warts and all” is not a frequently employed reporting strategy. By selecting the illustrations, authors direct and focus attention on particular aspects of the study. To a certain extent, such focusing is inevitable, for not every observation or detail in a study will be published. The problem arises when the information presented is selectively biased toward a particular interpretation, or when the selection makes the technical aspects of a study appear better than they really are. On a technical level, microscopic fields that exhibit prominent debris or precipitates are seldom shown, as they might lead the reader to think that techniques were applied sloppily. Authors often choose examples that exaggerate effects, so what is presented is not necessarily representative. For example, if the authors claim that a high percentage of cells stain with a particular antibody, but there is variation in the actual percentage of cells stained in various fields, they would generally choose for publication a field that most strongly demonstrated the conclusion.

Arrows Typically, authors insert arrows or other markers into micrographs. This stratagem has two strengths: (1) It

directs attention to the features that most strongly feature in their interpretation; (2) the marker itself might obscure some feature that the authors do not want to be seen.

Providing context Authors may or may not provide context for their illustrations. A useful strategy for photomicrographs is to include a figure at lower-power magnification, with the area containing the feature of interest shown in higher magnification as an inset or separate figure. In this way, the “big picture,” which will necessarily be more representative, can be given along with a detailed view of the feature of interest. Another means of providing context is the scale or micron bar printed directly on a micrograph. Some journals allow authors to place the magnifications in the figure legend. In theory, either method enables the reader to measure the size of features in a micrograph. However, if only magnifications are given, a reader must measure the feature of interest on the micrograph and divide by the magnification to determine the size of the feature. That takes some mental effort, and readers, being cognitive misers, will be unlikely to make the effort. So, if authors have something whose size is not what it should be, they would be wise to eschew scale bars and, in their stead, report magnifications. Context for figures reporting quantitative data are given by the controls. For instance, if the effects of a drug are being studied, the negative control (no drug) will show the basal response of the assay system, and the positive control (eg, a drug known to maximally affect the response) will demonstrate the possible responsiveness of the assay system. Authors may manipulate their choice of controls to suit their interpretive needs. A study touting the effects of a new mouthrinse on breath odor might compare the new product, not with the best available treatment, but rather with a wellknown product of marginal effectiveness.

“Persuading with pap” This phrase, taken from Monmonier’s How to Lie with Maps,33 refers to the practice of using highly simplistic maps—or, alternately, maps with irrelevant minutiae; the former persuade readers by reducing complexity, while the latter obscure important or inconvenient points by burying them in a mass of details. In science, schematic diagrams or cartoons suggesting mechanisms or relationships can perform this role. Modern biologic and clinical studies are often complex, and

187

Brunette-CH14.indd 187

10/21/19 11:15 AM

PRESENTATION OF RESULTS

interpretations are made based on relationships of varying strength. Readers, who are almost always cognitive misers, like authors to do their thinking for them, so they welcome simplified diagrams that make vague concepts concrete. However, a weak set of data that just reaches statistical significance but shows only a small effect size may become a solid arrow indicating a firm cause-effect relationship in a schematic diagram on mechanism. Similarly, a cartoon illustrating the findings of a study on a drug affecting cell signaling might feature a well-established fact (eg, the structure of a G-linked receptor, with its seven transmembrane components) casting a halo over the less well-established structures and relationships that were actually studied.

Leading the reader Leading the reader is a particularly useful tactic in figure legends, where authors will effectively tell the reader what to think; that is, they will mix in interpretation with the description of the observations. Being cognitive misers, readers are often not averse to being led. Figure legends are expected to be brief, so that when interpretation is placed in legends, qualifying words and caveats can be left out without attracting undue attention. Leading the reader is essential in achieving the persuasive intentions of authors; if an interpretation is not given in the figure legend, readers will have to figure it out themselves. In such a situation, three eventualities are possible (two of which, from the author’s point of view, should be avoided): (1) readers will stumble on the same conclusion as the authors, (2) they will reach a different interpretation, or (3) they will become confused.

The best strategies and uses of graphics The best strategy is to lay out the figure in such a manner that the explanation or interpretations are self-evident. Tufte has given the principles of how this can be done, as well as numerous examples of where it has or has not been done, in a series of insightful books.1,34–36 The style of the Tufte books reminds me of Marshall McLuhan, the late University of Toronto professor and media guru, such as his book The Mechanical Bride, wherein each section is a case study packed with insight. In the view of Grady,2 Tufte’s books are not organized around a clear line of analysis but rather more resemble a series of meditations on three themes:

1. A focus on data that can be quantified. 2. A concern with the ways in which cognition and esthetics are entwined; the principles of visual legibility applied to a problem is a necessary component of clear thinking. 3. A commitment to policy-driven applications. For example, Tufte analyzed the engineers’ attempts to document the history of O-ring damage on the Space Shuttle Challenger that led to the death of seven astronauts. Although relevant data was presented, it was presented in such a way that the most important element, the effect of temperature on O-ring damage, was not immediately evident. Authors of scientific papers traditionally use graphics to document their arguments and conclusions. Tufte urges readers to clarify their thinking by using graphics as an exploratory tool in understanding their data. In general, Tufte believes that the most effective illustration combines the direct visual evidence of images with the explanatory power of honestly conceived diagrams. An example is Tufte’s examination36 (pages 27–37) of the famous story of how John Snow halted the spread of cholera in London in 1854. There are numerous pieces in the solution of the puzzle, but the key one was that John Snow removed the handle of the Broad Street pump that appeared to be the source of contamination. Snow’s biographer stated it simply as “the pump handle was removed and the plague was stayed.” Tufte replotted the data of deaths with time and showed that in fact the death rate from cholera started falling some five days before Snow removed the pump handle, and directs readers to consider some of the other contributors to the staying of the plague. Tufte has developed a number of useful principles that have been widely used. Only a few of the many important or interesting applications are noted here: • Avoid chartjunk. Chartjunk is unnecessary decoration added to a figure, such as the intimidating male and downtrodden female (see Fig 14-11). Often chartjunk is added to fill in space when the data itself is thin. Tufte notes that chartjunk betrays the author’s contempt for both the information and the audience. Credibility vanishes in clouds of chartjunk. Cartographers in the age of discovery had many gaps in their knowledge of geography and sometimes filled in those gaps with fanciful illustrations of natives, elephants, depictions of the four winds, and monsters. These creatures have their charm, but they do not transmit useful information.

188

Brunette-CH14.indd 188

10/21/19 11:15 AM

Illustrations

• Layering and separation. One example of layering is the use of different colors to represent different elements. The maps of rapid transit systems, such as the London Underground, are stellar examples of the effective transmittal of information. To achieve that state there was a constant evolution in map design.37 The various lines are denoted by different colors. Early versions showed geographic features such as rivers or hills and the locations of the lines and stations, but as the system expanded, the map became too complex so that the information most relevant to the passenger, such as the sequence of the stations along a line, were difficult to discern. Realistic geographic features were then removed and the direction of the lines and their relationship to each other emphasized. Once the limitations of conforming to the natural topography were removed, it was possible to optimize the functionality of the map. To effectively use the available space on the graphic, the angles between the lines and individual stations were standardized but do not reflect the actual geographic relationships. With the increase of useful space, the typography could be adjusted to increase legibility. • Small multiples. In this technique, a visual template is used repeatedly. For example, in a medical chart for hospital use, various patient properties such as glucose level, drug concentrations, blood pressure, CO2, and temperature are plotted with time in a standard format. As the units differ, key values for each property are indicated: critically elevated or critically reduced, elevated, reduced, and normal range. The time includes data more than 1 year prior to admission, 1 year prior to admission, and then daily after admission. The clinician can readily discern the patient’s current condition as all the data are immediately accessible on a single sheet rather than scattered through various binders. • Parallelism. Strunk and White38 list parallelism as one of their key principles of prose composition, the principle of parallel construction-express coordinate ideas in similar form. The eight beatitudes exhibit this virtue, all having the form “Blessed are the . . . for they shall be. . . .” By this rule, an article or preposition applying to all members of a series must be used only before the first member of a list or before all the members of a list. • Tufte extends this principle to visual explanations, stating, for example, that multiple parallelism is a natural design strategy for music and sound. His illustration36 shows the 88 keys of a piano labeled by note at the bottom, above which is the frequency

of the sound, then the musical annotation, then the ranges of singing voices of humans, then the ranges of musical instruments. • Parallelism also excels in storytelling examples, such as the story of marketing trends and stylistic developments, where the x-axis is time and the y-axis is sales, and the various styles are shown as streams in the pop/rock river. I was mortified to learn that some of the favorites of my youth were classified under shlock rock, so much so that I considered ditching my CD of one-hit wonders. In any case, like the stacked bar graph, it is difficult to estimate exactly the sales of each genre; however, the artists and their relative popularity are readily evident. The book jacket cover of Tufte’s Beautiful Evidence shows a sequence of photos of a golden retriever jumping into the water, and such sequences are powerful as they show actual phenomena that sometimes elude impactful description. One of the staples of my scientific presentations on the effects of surface topography on cell behavior were time-lapse films showing cells migrating on various topographies. The cells could be shown to move along grooves in one direction and then turn 90 degrees when they arrived at sets of grooves at right angles to the first set—convincing evidence of how surface structure can guide cell migration. Advertisers are well aware of the power of graphic storytelling. A common strategy in advertising is to demonstrate a problem (such as a streaky window), have one actor suggest to the other a way of resolving the problem (a special cleaner), demonstrate success, and celebrate the result. Figure 14-6 shows my grandson, Calixte Brunette, age 16 months, solve the problem of undoing a hook and eye latch on the gate that bars his way to the great upstairs. Four pictures tell the story. As the ancient Romans used to say: res ipsa loquitur—“the thing speaks for itself.” I don’t have to tell my friends the child is a genius; they tell me—a much more satisfactory (and less blatant) means of having the message delivered.

Evaluating graphs: Tufte’s evaluative ratios Graphs are often used in place of tables if a pronounced trend or relationship between the variables is plotted. Appearance and clarity of figures are important in determining effectiveness. Guidelines for the visual presentation of charts and graphs in the life sciences have been published by Simmonds and Bragg, 39 in

189

Brunette-CH14.indd 189

10/21/19 11:15 AM

PRESENTATION OF RESULTS

Problem

Proposed resolution

Success

Celebration

Fig 14-6 Sequence of Calixte Brunette, age 16 months, solving the problem of the hook and eye latch and celebrating afterwards. Taken from an iPhone video courtesy of Malizza Brunette.

association with the Institute of Medical and Biological Illustration, as well as by Tufte, in his excellent and entertaining book, The Visual Display of Quantitative Information.34 Tufte has devised several indices to evaluate illustrations. The first two deal largely with the efficiency of information presentation.

Data-ink ratio Data-ink ratio (DIR) is the proportion of the graphic’s ink devoted to the nonredundant display of data: DIR =

data ink total ink to printthe graph

Tufte advises maximizing the DIR within reason; the closer the ratio is to 1.0 the better. In practice, this often means eliminating grids, plotting points boldly, and erasing redundant data ink or unnecessary nondata ink, such as redundant labels. For example, there is often no point in presenting both halves of symmetric measures, such as error bars.

Data density index The data density index (DDI) of a graphic measures the amount of data displayed in relation to the graphic’s size: DDI =

no.of entries in data matrix area of graphic

Tufte surveyed a number of scientific journals and computed their median DDI: Nature scored 48; Science, 21; New England Journal of Medicine, 12; and Scientific American, 5. He concluded that the average published

graphic is pretty thin, that is, it does not illustrate much data for the area it occupies (see Fig 14-11). It will be interesting to see how the advent of electronic publishing affects data density. In standard print journals, the cost of publication is related to the length of the articles. Some journals, such as Quintessence’s International Journal of Oral and Maxillofacial Implants, ask referees whether any figures or illustrations can be omitted. However, for electronic publications, article length is not a significant factor in cost, and editors will probably be less vigilant in identifying pointless figures.

Lie factor The lie factor (LF) deals with quantifying the distortions present in some figures and is defined as follows: LF =

size of effectshown in graphic size of effectin data

The LF will be illustrated later.

Evaluating graphics: Cleveland’s hierarchy of graphical perception By investigating the perception of graphs from the theory of visual perception and by performing experiments in graphical perception, Cleveland40 was able to assess the accuracies with which readers perform graphical perception tasks. For information to be most accurately interpreted by the reader, data should be displayed so that the reader uses the most accurate processes of graphical perception. The following list presents Cleveland’s hierarchy40 of perception, from the most accurate to the least accurate:

190

Brunette-CH14.indd 190

10/21/19 11:15 AM

Illustrations

100

Test

Test

Problems Oral

Problems Oral Smith

80

Wu Name

Grade

60

40

Yu

20

Jones

0 Jones a

Yu

Wu

0

Smith

Name

b

10

20

30

40

50

Grade

Fig 14-7 The grade data for four students are presented in two graphs to demonstrate Cleveland’s hierarchy of perception. (a) Stacked bar graph. (b) Bar graph with a common scale; in this configuration, a fourth bar (total grade) could be added for each student, if desired.

1. 2. 3. 4.

Position along a common scale Position along identical, nonaligned scales Length Angle/slope (Cleveland’s data cannot discriminate between these two.) 5. Area 6. Volume 7. Color hue, color saturation, color density As an illustration of Cleveland’s hierarchy, consider Fig 14-7. A student’s grade in my undergraduate course ORBI 430 was calculated by the sum of the marks on a test, problem sets, and an oral examination. Figure 14-7a presents the data for four students as a stacked bar graph. Because the oral component has a common baseline, you can compare the grades of the students directly. Comparing the problem-set and test values is difficult, however, because they involve length judgments (ranked 3 on the Cleveland hierarchy), which are not as accurate as position judgments along a common scale (ranked 1). This information could be presented as a bar graph with a common scale (as shown in Fig 14-7b), which makes it possible to compare all components on

a common scale, and, consequently, process the information more accurately. Cleveland believes that it is never necessary to resort to a divided bar chart, because any set of data that can be shown by a divided bar chart can also be shown by a graphical method that replaces length judgments by position judgments. Although bar graphs have the advantage of comparing data along a common scale, there is a potential problem in the manner in which the bars are filled. The difference in contrast between the hatchings should not be too dramatic, because it has been found that the eye is drawn to the black areas, making them difficult to compare with unshaded areas. As another example of applying Cleveland’s hierarchy, consider the distribution of the UBC expenditures, presented in Fig 14-8a as a pie chart. Comparing the various components requires the reader to make angle judgments—a task rated fourth in the Cleveland hierarchy. Angles cannot be estimated as accurately as position judgments along a common scale. Thus, Fig 14-8b, a bar chart presenting the same data, enables comparisons between the categories to be made both more accurately and more conveniently than in the pie chart. Although

191

Brunette-CH14.indd 191

10/21/19 11:15 AM

20

10

Other

Utilities Benefits a

Other

Staff salaries

30

Student service salaries

Student service salaries

Academic salaries

40

Utilities

Staff salaries

% Total UBC expenditures

50

Benefits

60

Academic salaries

PRESENTATION OF RESULTS

0 b

Distribution of UBC expenditures

Fig 14-8 The distribution of UBC expenditures is presented in two forms to demonstrate the Cleveland hierarchy of perception. (a) Pie chart. (b) Bar graph.

often derided by statisticians and graphic designers, pie charts have been shown by Simken and Hastie41 to be effective in certain situations. Pie charts are not widely used in scientific graphics but are common in advertising and other mass media applications, where they are often tilted and given depth so they appear like floating platters. These additional manipulations make them even more difficult to interpret. Students sometimes confuse bar graphs and histograms. A bar graph has spaces between the bars and is used when the bars represent discrete factors, such as exposure to some treatment or data obtained from different species. A histogram has no spaces between the individual columns and is used when the columns represent a variable that can be varied continuously. Histograms reveal the distribution of a variable.

The art of deception as applied to graphs and figures Good graphics display data accurately and clearly. With the advent of computer graphics, a large amount of data can be presented clearly in a small amount of space. However, a clear, accurate representation of all the data does not always serve authors’ interests in demonstrating their arguments, and some

authors, knowingly or unknowingly, stoop to the art of deception. Our concern in this section is how graphs can be presented so that they appear to strengthen arguments. Elements of the art of deception have been practiced for some time. Huff’s42 1954 classic How to Lie with Statistics dealt with the topic and has been updated by the more recent How to Display Data Badly by Wainer.43 Cleveland’s40 landmark book The Elements of Graphing Data presents the current standards and rationale for presenting data clearly.

Use of scales Consider the graphs presented in Fig 14-9, which plot expenditures of the province of British Columbia (BC) on dentists.The increase in expenditure seems more rapid in Fig 14-9b than in Fig 14-9a, solely because of the scale of presentation. It should be pointed out that Fig 14-9a is a poor graph because the scale units on the ordinate are too large (a general rule of thumb is that scales should be selected so that the curve at least roughly extends over about two-thirds of the range of both ordinate and abscissa). Cleveland (quoted in Wainer29) advises that scales be chosen so the data fill up as much of the data region (ie, the area formed by the scale rectangle) as possible but insists that authors

192

Brunette-CH14.indd 192

10/21/19 11:15 AM

Illustrations

400

1000

$ (millions)

$ (millions)

800 600 400 200 0

200

0 1974 1976 1978

1980 1982 1984 1986

1974 1976 1978

Year Fig 14-9a  Plot of BC’s expenditures on dentists. This graph uses large units on the ordinate.

Year Fig 14-9b Plot of BC’s expenditures on dentists. This graph has obvious extrapolation difficulties. 1000

2.7

800

2.5

$ (millions)

$ (millions)

1980 1982 1984 1986

2.3 2.1 1.9

600 400 200 0

1974 1976 1978

1980 1982 1984 1986 Year

1974 1976 1978

1980 1982 1984 1986 Year

Fig 14-9c Log of BC’s expenditures on dentists.

Fig 14-9d Plot of BC’s expenditures on dentists (circles) and physicians (triangles).

should always be willing to forego this fill principle to achieve an effective comparison. In particular, comparing a number of panels in a single figure will generally work better if all panels have the same scale so that comparisons can be made between panels even if the individual panels are not filled. Fig 14-9b is clearly better, but it also has problems. Using this curve makes it difficult to predict what the expenditure will be in the future because the form of the curve is difficult to extrapolate. This can be solved by plotting the log of expenditures, as shown in Fig 14-9c. Because this curve is fitted to a smooth curve, it is easier to extrapolate. Of course, such an extrapolation will not necessarily yield the right prediction because conditions might develop that destroy the assumption that the past pattern applies to the future. For instance, the onset of a recession might affect expenditures on dentistry substantially. Logarithmic scales are useful when it is important to understand percent change or multiplicative factors.

The most common is a semi-log plot, in which the log of the dependent variable is plotted against a linear scale. On a semi-log plot, percentages and factors are easier to judge because equal multiplicative factors and percentages result in equal distances throughout the entire scale. That is, when the slope of a semi-log plot is straight, the rate of relative change is constant. Semi-log plots are often used in growth curves, where cell number is increasing exponentially, and the plot of cell number with time on a semi-log plot will be linear. Similarly, dose-response curves can be plotted using a semi-log plot because dose can be logarithmically related to the number of receptors stimulated. Like any curve-fitting procedure, the presence of a straight line does not uniquely identify any particular process, so the straight line semi-log plot found for expenditures in dentistry could result from (1) dentists increasing their fee schedule by the same percentage per year while the number of procedures remains constant; (2) dentists charging the same for procedures while the number of

193

Brunette-CH14.indd 193

10/21/19 11:15 AM

PRESENTATION OF RESULTS

procedures grows by the same rate every year; or (3) a combination of such factors. In any case, each of the three curves gives a different impression about consumer spending. Figure 14-9a could be used by a dental association to show that consumer spending on dentistry is rising slowly; Fig 14-9b, by a consumer association to prove that dental expenditures are rising too rapidly; and Fig 14-9c, by the government to estimate the taxes they will receive from dentists. Scales can be used to affect the perception of variation in individual data points. A log scale, for example, will compress the apparent amount of variation. An advantage of taking logs is that the overlap caused by positively skewed data will often be alleviated, and the resolution of the graph will be improved. Thus, there are many valid reasons why data should be transformed and various rules that should be applied.44 In summary, the scale used affects our perception of the data. If a reader is suspicious about a scale, the best thing to do is to replot the data in a more suitable form and see how it looks.

Suppress the baseline Another trick in the art of deception (used in histograms as well as graphs) is simply to exclude the zero value of the ordinate (Fig 14-10) and to expand the scale. An obvious advantage for the author is that baseline suppression magnifies any differences between the groups. Moreover, including zero in every graph makes no sense if it makes the ordinate axis so large that the resolution of the graph is destroyed; thus, zero should be included in the scale if it does not waste undue space or seriously degrade the comparisons being made. If zero is not included, a break in the ordinal scale or a note in the legend should indicate its absence. The common practice is to have the scale markers and grid lines placed at regular intervals (such as 10, 20, 30, etc) because that practice aids interpolation. However, Wainer29 commented on William Playfair’s widely admired graph of the English National Debt in the period of 1688 to 1786 that “Playfair’s genius was in the placing of gridlines at points of interest based on the data and labeling them” (such as 1739, the beginning of the Spanish War) so that “answers are provided for the natural questions than anyone looking at the chart would ask.”

Hype a small amount of data Figure 14-11, redrawn from the feminist magazine A Room of One’s Own,45 illustrates the tactic of data hype. The amount of information presented is small, comprising only six pieces of data in an area of approximately 20 cm: the percentage of individual grants awarded to men and women, the percentage of grant monies awarded to men and women, and the amounts of money awarded to men and women. Moreover, because—for either grants or money—the percentages allocated to men and women must add up to 100%, there are actually only three pieces of information: the total amount of money, the ratio of money granted to men and women, and the ratio of men to women receiving grants. Tufte’s data density index can be calculated as follows: DDI = 3 = 0.15 20 Wainer43 reported that the DDI for popular and technical media ranged from 0.1 to 362. Clearly, this illustration falls in the lowest strata.

Hiding the data Omit essential data Figure 14-11 also omits important data. To assess these data, you need to know the distribution of men and women applying for grants. If equal numbers of men and women applied, there would be reason to consider that there might be some discrimination against women. But if the ratio of men to women applying for money was 2:1, there would be no evidence for bias. Odious comparisons Most observers would agree that Fig 14-9a looks wrong because there is too much blank space. Readers might question the scale and figure out that the chosen scale makes the increases look small. To banish such thoughts, the author might include another line on the graph. In Fig 14-9d, data on spending on physicians has been added to the graph. The scale is legitimized by this second curve. The expenditure on dentists’ services looks small relative to the increased expenditures on physicians.

Using areas or volumes to represent unidimensional data Figure 14-11 also illustrates this principle. Note that the two parts of the illustration cite percentages, a linear

194

Brunette-CH14.indd 194

10/21/19 11:15 AM

Illustrations

$125,000

$120,000 $100,000

$75,000

Income

Income

$100,000

$50,000 $25,000 $0

$80,000 $60,000

Professors

Professors

Physicians

Occupation Fig 14-10a Bar graph with zero value included in ordinate (income).

$100,000 Income

Fig 14-10b Bar graph without zero value in ordinate.

Individual grants awarded by gender (%)

$120,000

Physicians

Occupation

Individual grant monies by gender (%) $2,212,300 to men 70%

Men 67%

$80,000 Women 33%

$60,000 Professors

$960,648 to women 30%

Occupation

quantity. Yet, these percent values are represented by pictures of dollar bills and people. Because the width of the dollar bill is proportioned to its height, the original ratio of $ male/$ female of 7:3, appears, by comparing areas, to be 6:1. We can calculate the lie factor, Tufte’s measure of distortion, as follows: LF =

size of effectshown in graphic size of effectin data

size of effect in graphic = 6 size of effect in data = 2.3 Lie factors greater than 1.05 or less than 0.95 indicate substantial distortion, far beyond minor inaccuracies in plotting.34 However, Cleveland40 has found that when people are shown two areas whose magnitudes are a1 and a2 and are asked to judge the ratio of a1 to a2, most will judge the ratio on a scale:

80 70 60 50 40 30 20 10

Physicians

Fig 14-10c 3D bar graph without zero value in ordinate.

% 100 90

Fig 14-11 Example of magnifying a small amount of data. Total number of grants awarded to individuals in all categories by theatre section of the Canada Council between 1972 and 1981 was 996. Total monies disbursed was $3,172,648. (Redrawn with permission from Fraticelli.37)

a1 a2

β

where β would typically have a value of 0.7. Thus, in this example, the perceived size of the effect in the graphic equals (6 ⁄ 2.3)0.7 = 3.5, and the perceived lie factor is probably closer to a value of 1.5, which still indicates substantial distortion. Until recently, the use of 3D figures to illustrate unidimensional values in scientific presentations has been infrequent. However, many computer graphic programs now offer the option of adding depth to bar graphs. Figure 14-10c shows how this tactic can be used to hype the professor/physician comparison, by presenting volumes to magnify the effect. It is difficult to determine the actual values of these 3D bar

195

Brunette-CH14.indd 195

10/21/19 11:16 AM

PRESENTATION OF RESULTS

graphs. In SYGRAPH: The System for Graphics, Wilkinson46 describes the addition of a third dimension as “perspective mania,” and notes that he cannot think of a single instance in which a perspective bar graph should be used. However, the SYGRAPH application offers a full assortment of 3D graphs. Why? “We had to make some concession to the marketplace.”46 Thus, it appears that some users of SYGRAPH, despite explicit advice to do otherwise, produce misleading graphics. Such behavior, like other examples in this chapter, may be further evidence for the doctrine of original sin.

References 1. 2.

3. 4.

5. 6. 7. 8. 9. 10. 11.

12.

13. 14. 15. 16. 17.

18.

Tufte ER. Beautiful Evidence. Cheshire: Graphics, 2006:9. Grady J. Edward Tufte and the promise of Visual Social Science. In: Pauwels K (ed). Visual Cultures of Science: Rethinking Representational Practices in Knowledge Building and Science Communication. Hanover: Dartmouth College, 2006. Tufte ER. Beautiful Evidence. Cheshire: Graphics, 2006. Kemp M. A perfect and faithful record: Mind and body in medical photography before 1900. In: Thomas A, Braun M (eds). Beauty of Another Order: Photography in Science. New Haven: Yale University, in association with the National Gallery of Canada, 1997:120–149. Judson HF. The Great Betrayal: Fraud in Science. Orlando: Harcourt, 2004:61–64. Godfrey E. By Reason of Doubt: The Belshaw Case. Halifax, Nova Scotia: Goodread Biographies, 1984. Medawar P. The Strange Case of Spotted Mice: And Other Classic Essays on Science. Oxford: Oxford University, 1996:132–143. Gould SJ. The Mismeasure of Man. New York: WW Norton, 1981:92. Purves D, Williams SM. Neuroscience, ed 2. Sunderland, MA: Sinauer Associates, 2001. Adams R. Beauty in Photography, ed 2. New York: Aperture, 1996:68 Schaaf LJ. Invention and Discovery: First Images. In Thomas A. (ed). Beauty of Another Order: Photography in Science. New Haven: Yale University in association with the National Gallery of Canada, 1997:26–59. Cazort M. Photography’s Illustrative Ancestors: The Printed Image. In: Thomas A (ed). Beauty of Another Order: Photography in Science. New Haven: Yale University in association with the National Gallery of Canada, 1997:14–25. Lawson D. Photomicrography. London: Academic, 972:310–313. Tufte ER. Beautiful Evidence. Cheshire, CT: Graphics, 2006:13. Rosenthal R, Rubin DB. A simple, general purpose display of magnitude of experimental effect. J Educ Psychol 1982;74:166–169. Wainer H. Graphic Discovery. Princeton: Princeton University, 2005:63–67. Wainer H. John Wilder Tukey: The Father of Twenty-First Century Graphical Display in Graphic Discovery. Princeton: Princeton University, 2005:117–124. Simonton DK. Career landmarks in science: Individual differences and interdisciplinary contrasts. Dev Psychol 1991;27:119–130.

19. Simonton DK. Creativity in Science. Cambridge: Cambridge University, 2004:69. 20. Wainer H. Graphic Discovery. Princeton: Princeton University, 2005:122–124. 21. Cleveland WS. The Elements of Graphing Data. Monterey, CA: Wordsworth, 1985:84. 22. Glantz SA. Primer of Biostatistics, ed 2. New York: McGraw-Hill, 1987:26–29. 23. Mainland D. Elementary Medical Statistics, ed 2. Philadelphia: Saunders, 1963:168–170. 24. Pearce SC. Biological Statistics: An Introduction. New York: McGraw-Hill, 1965:10. 25. Gilbert N. Biometrical Interpretation. Oxford: Clarendon, 1973:23. 26. Huth EJ. How to Write and Publish Papers in the Medical Sciences. Philadelphia: ISI, 1982:124–126. 27. Day RA. How to Write and Publish a Scientific Paper. Philadelphia: ISI, 1979:49. 28. Weisman HM. Technical Report Writing. Columbus, OH: Merrill, 1975:104. 29. Wainer H. Graphic Discovery. Princeton: Princeton University, 2005. 30. Lang TA, Secic M. How to Report Statistics in Medicine, ed 2. Philadelphia: American College of Physicians, 2006 31. Mainland D. Elementary Medical Statistics, ed 2. Philadelphia: Saunders, 1963:170. 32. Hill AB, Hill ID. Bradford Hill’s Principles of Medical Statistics, ed 12. London: Edward Arnold, 1991:53. 33. Monmonier M. How to Lie with Maps, ed 2. Chicago: University of Chicago, 1996:79. 34. Tufte ER. The Visual Display of Quantitative Information. Cheshire: Graphics, 1983:57, 161–169. 35. Tufte ER. Envisioning Information. Cheshire: Graphics, 1990. 36. Tufte ER. Visual Explanations: Images and Quantities, Evidence and Narrative. Cheshire: Graphics, 1997. 37. Ovendon M. London Underground by Design. London: Penguin, 2013:22–27 38. Strunk W, White EB. Elements of Style, ed 3. New York: MacMillan, 1979:26–28. 39. Simmonds D, Bragg G. Charts and Graphs. Lancaster: MTP, in association with the Institute of Medical and Biological Illustration, 1980. 40. Cleveland WS. The Elements of Graphing Data. Monterey, CA: Wadsworth Advanced Books and Software, 1985:229–294. 41. Simkin D, Hastie R. An information-processing analysis of graph perception. J Am Stat Assoc 1987;82:42. 42. Huff D. How to Lie with Statistics. New York: Norton, 1954:39–73. 43. Wainer H. How to display data badly. Am Stat 1984;38:137. 44. Zar JH. Biostatistical Analysis, ed 2. Englewood Cliffs, NJ: Prentice Hall, 1984:236–242. 45. Fraticelli R. “Any black crippled woman can!” or a feminist’s notes from outside the sheltered workshop. A Room of One’s Own 1983;8(2):10. 46. Wilkinson L. SYGRAPH: The System for Graphics. Evanston, IL: SYSTAT, 1989:54.

196

Brunette-CH14.indd 196

10/21/19 11:16 AM

15 Diagnostic Tools and Testing Ben Balevi, DDS, MSc Donald Maxwell Brunette, PhD

Clinical Practice Patients seek a dentist to address an oral health issue that currently bothers them (ie, the chief complaint). The patient’s objectives are to understand what their problem is (diagnosis), how bad it is (prognosis), and what can be done about it (treatment plan). However, the information gathered about a patient through a clinical examination usually does not necessarily determine the true state of a patient’s health status but rather indicates a list of possible conditions (ie, differential diagnosis).2,3 Clinicians then go through an iterative exercise of “ruling in” or “ruling out” specific diagnoses from their list of differentials. This refinement of the list of possible options involves interpreting the patient’s signs and symptoms against a backdrop of the clinician’s current psychosocial and biologic knowledge base. This includes the scientific evidence compiled in findings from published clinical research articles, conferences, workshops, opinion leaders, and peers as well as the clinician’s own clinical experience. For example, novice clinicians sometimes tend to misdiagnose common conditions in favor of exotic ailments that may reflect the continuing education course they attended the past weekend. These clinicians fail to heed the commonsense adage: “If it looks like a duck, quacks and waddles like a duck, then it probably is a duck!” Common things do occur commonly, and this phenomenon is the basis for the adage, “When you hear hoofbeats, think of horses and not zebras.” Of course, the correct application of this adage depends upon whether the hoofbeats are heard on the plains of North America or on the plains of the Serengeti. These adages respectively illustrate two concepts: the principle of pattern recognition and the effect of prevalence. Both principles are important components of the diagnostic process.



A smart mother makes often a better diagnosis than a poor doctor.” AUGUST BIER GERMAN 1

In memory of my sister Haya. —Ben Balevi

197

Brunette-CH15.indd 197

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Arriving at differential diagnoses and treatment plans requires an organized approach to the interpretation of the patient’s symptoms and signs. Clinical symptoms are the patient concerns (ie, pain, fatigue) from which they seek relief.4 Clinical signs are the clinician’s objective observations (ie, bleeding on probing, sensitivity to percussion, periapical radiolucency, vital signs) that may be indicative of the patient’s health status.5 One method used to facilitate the gathering/documentation of information and to generate diagnoses and treatment plans is the SOAP format. SOAP is an abbreviation for the following: Subjective information Objective information Assessment Plan Subjective information is the information provided by the patient, ie, what the patient tells the clinician: • • • • • •

Chief complaint Medical history Dental history Social/work/family history Symptoms Stated preferences for treatment, nontreatment, etc

Objective information is the observations and findings of the clinician: • • • •

Patient’s vital signs Clinical findings (signs) Radiographic findings Any diagnostic test results

Assessment is the clinician’s interpretation of the subjective and objective information: • Diagnosis via differential diagnosis • Etiology (ie, what causes the problem) • Prognosis (ie, the likely course of a disease or ailment without treament) The plan is the course of action or treatment/management proposed by and/or performed by the clinician, based on this assessment: • Treatment options • Documentation that the treatment options were presented and discussed with the patient

• Cost estimates • Details of actual treatment delivered Despite this standardized SOAP process, identifying the current nature of a patient’s condition is a challenging aspect of clinical practice. The discovery of pathology in the symptomatic or asymptomatic patient requires systematic examination and critical assessment of the information as well as the cues given by the patient. Diagnosis is a task of trial and error performed in a world of imperfect information and uncertainty. The following illustrates the inexactness of clinical practice. Dr Sally Smart has been in practice for just over a year but graduated second in her dental class from McGill University. Ralph is a 54-year-old man who walks into Dr Smart’s dental clinic complaining of intermittent orofacial pain on his right side. He mentions that the pain is sometimes triggered by eating, especially biting on hard food, after which he sometimes aches. It often will subside by itself but sometimes gets worse in the morning when he wakes up. Sometimes the pain occurs without provocation. Ralph’s medical history includes the use of a beta-blocker and an antidepressant. From this initial encounter, Dr Smart must try to first distinguish if the etiology of his symptoms is odontogenic or not. Such uncertainty is a fact of all aspects of decision making due to the imperfect or incomplete information often available early in the interactions with patients. For example, Dr Smart must determine the degree of truthfulness of Ralph’s statements. This doubt is not necessarily a reflection of the patient’s integrity, but more often speaks to the accuracy of his story. For example, Ralph remembers the onset of symptoms a week ago, but maybe he had a similar episode a few months ago that he ignored because it went away on its own. Also, Ralph may not remember any recent traumatic event he can associate with his symptoms, but Dr Smart’s persistent questioning reminded him of a recent event when his grandchild accidently “whacked” his jaw. Furthermore, Dr Smart noticed Ralph is a little out of breath, which he claims is from walking up three flights of stairs to get to her office. Although Dr Smart’s natural inclination is to make an odontogenic diagnosis, she must first rule out any nonodontogenic etiology, for example, referred pain from an ischemic myocardial infarction (MI) or angina. Ralph already disclosed being on hypertensive medication (beta-blocker). Dr Smart digs further into the patient’s history to rule out angina or possible MI. Dr

198

Brunette-CH15.indd 198

10/9/19 11:37 AM

Clinical Practice

Smart takes Ralph’s blood pressure and finds it to be high (160/100) with an erratic pulse as a possible red flag. Although such findings are not diagnostic and the likelihood of an MI is low, its consequences are significant. Dr Smart immediately refers Ralph to Dr Fredrick Fleming, a family doctor who is located just across the hall. The medical clinic triages Ralph directly to the attention of Dr Fleming. After undergoing a battery of tests including an electrocardiogram (EKG) and blood test (ie, troponin levels < 0.2 ng/mL), Dr Fleming is confident Ralph is not experiencing a cardiac event. Granted, the physician is also relying on imperfect information, and there is a risk of a false negative with the EKG machine and blood test, so without doing a stress test or angiogram he cannot be absolutely sure of his diagnosis. Nevertheless, Dr Fleming is certain enough that no further tests are necessary and the probability of an MI is so low that any further test cannot be justified on a cost-benefit and risk-of-harm analysis. At the same time, the physician does a comprehensive physical examination followed by a routine blood test to rule out any other nonodontogenic explanation for Ralph’s symptoms. Ralph returns back to Dr Smart’s office immediately after finishing with the physician on the off chance that he can still be seen that day. As luck would have it, Dr Smart has a student loan payment due and is more than happy to work late. Now that Dr Smart is comfortable Ralph is not undergoing a cardiac event, she can do her physical examination of Ralph’s extraoral and intraoral tissues. Dr Smart finds normal-looking soft tissue. There is mild discomfort on palpation to Ralph’s masseter muscles, with a slight jaw movement to the left on opening. Also, Ralph presents with generally good periodontal status: with periodontal pocketing in the 2- to 4-mm range, minimal bleeding on probing (BOP), but a 3-mm gingival recession on the maxillary posterior teeth. Teeth nos. 14 and 16 are nonrestored. Tooth no. 15 has a disto-occlusal composite resin restoration and no. 17 has a full-cast crown. No obvious caries lesion is found visually or with the assistance of an explorer. Ralph gave a vague response to the bite stick test on teeth nos. 15 and 16, but a positive test was not repeatable. Furthermore, Ralph’s teeth are not sensitive to percussion. The electric pulp test (EPT) gives a nonrepeatable positive response to teeth nos. 15 and 16 that is inconclusive to Dr Smart. Radiographs (ie, bitewing and periapical) show that tooth no. 15 has a disto-occlusal composite resin restoration with no significant radiolucency around it and

what appears as 2 mm of healthy dentin tooth structure between the restoration and the pulp chamber. Tooth no. 16 appears radiographically sound with no significant radiolucency, suggestive of possible interproximal caries. Tooth no. 17 has no history of endodontic therapy, and the full gold crown obliterates radiologic detection of a pulp chamber at risk. On further query, Ralph cannot remember he even has a crown on this tooth, but when prompted he says, “Oh yeah. That was done by Dr Chehroudi a long time ago. I think he retired. He was such a good dentist.” After further assessment, Dr Smart informs Ralph that she can’t be sure if his symptoms arise from a self-limiting dental sensitivity due to root exposure or a cracked tooth and/or irreversible pulpitis most likely originating from teeth nos. 15 and 17 and unlikely to come from the nonrestored teeth nos. 14 and 16. Furthermore, Dr Smart tells Ralph, he may simply have self-limiting temporomandibular disorder (TMD) pain as a result of the being whacked by his spoiled grandchild. Ralph is not happy with the lack of a definite diagnosis from Dr Smart. He decides to get a second opinion from a dentist in a medical building across the street. Dr Bill Brainiac is also working late because he has a second mortgage payment due and can accommodate Ralph on short notice. Dr Brainiac does the same battery of tests that Dr Smart did less than an hour ago but reports some slightly different readings, specifically that tooth no. 17 is slightly sensitive to percussion on repeated tries and gives a hypersensitive response to the vitality testing. Dr Brainiac informs Ralph with the characteristic confidence of an accomplished University of British Columbia graduate with over 10 years of clinical experience that his symptoms are a result of an irreversible pulpitis with early signs of periapical periodontitis on tooth no. 17. He suggests root canal therapy on this tooth. Ralph has received two very different approaches to managing his symptoms. Dr Smart is suggesting that they continue to monitor the symptoms until they become localized, while Dr Brainiac is recommending root canal treatment to be done promptly. Now that Ralph has received a definite diagnosis from Dr Brainiac with a proposed treatment, he must decide which plan to follow: (1) Monitor with the possibility that the symptoms will turn out to be self-limiting related to TMD or will eventually be localized to the offending tooth but with the risk of developing a severe toothache in the interim, as suggested by Dr Smart, or (2) go with Dr Braniac’s suggestion of undergoing costly and invasive root canal therapy on tooth no. 17 with the risk of a misdiagnosis resulting in an unnecessary procedure.

199

Brunette-CH15.indd 199

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

After a long hard day of clinical practice, Dr Smith and Dr Brainiac sometimes go to the local watering hole for happy hour drinks. Over a beer they often share their clinical cases and realize they both examined Ralph today. Bill tells Sally of his highly certain diagnosis on tooth no. 17 based on the percussion test and the EPT. This leaves Sally wondering how it was possible that Bill was apparently able to establish a confirmatory result from these tests while her readings were vague and inconclusive. Furthermore, Sally asked Bill why did he not consider TMD, to which he responded, “Because the patient never disclosed that he was recently whacked.”

Principle of Diagnosis Fundamental to successful patient care is correctly identifying the patient’s problem. In other words, before a dentist can offer any resolution to a patient’s problem, the problem must be correctly understood and described by the clinician. In clinical practice the “patient’s problem” is termed the diagnosis, and its management to resolution is through treatment. The Oxford Dictionary defines diagnosis and treatment as follows: • Diagnosis is identifying the nature of an illness or other problem by examination of the signs and/or symptoms.6 • Treatment is the management and care of the patient for an illness or injury.7 The subjective and objective information, as reflected in the patient’s symptoms and the clinically observed signs, may collectively be considered diagnostic tools, because both information sources are used to generate a diagnosis.8 The fundamental purpose of a diagnostic test is to either “rule in” or “rule out” the presence of a particular disease or condition, so that individuals with the condition can be distinguished from those without the condition. This diagnostic process demands the application of exact criteria to define the condition that is the target of the process. In general, clinicians, researchers, and patients agree that the presence of disease indicates a change in anatomy, physiology, biochemistry, or psychology, but they are less likely to agree on the exact criteria that define a particular condition.9 Two major principles of disease are described by Wulff10: (1) the nominalistic or patient-oriented

principle, and (2) the essentialistic principle, which emphasizes disease as an independent entity. In the patient-oriented nominalistic approach, disease classification concerns the classification of sick patients. Disease is not considered to exist as an independent entity, and a particular disease is defined by the group of characteristics that occur more often in “patients” with the disease than in other people. Patients demonstrate a pattern of similar symptoms and signs, as well as similar prognoses and responses to treatments. Moreover, the nominalistic approach does not require a definition of “normalcy” and acknowledges that definitions of disease may vary between different societies.10 The essentialistic view10 is closely related to a more contemporary principle of disease termed biochemical fundamentalism,11 which is based on biochemistry and molecular biology. Once the underlying biochemical events are understood, the course of a disease can be predicted theoretically, because diseases are assumed to follow regular patterns. Definition of the “normal” state is avoided; biotechnology and statistical concepts define the disease state in terms of both the distribution of specified features in a specific population and the extent to which that distribution differs from a similar assessment of a group the investigators consider “not diseased.”10,11 In fact, this statistical approach forms the basis for utilizing biomarkers as diagnostic or screening tests. Irrespective of the approach, essential prerequisites in arriving at a diagnosis should ideally include a clear, unambiguous definition and an understanding of the natural history of the particular disease being investigated. Definitions of a specific disease may change over time, and specified criteria may be limited to specified populations at specified time periods. Testing protocols/conditions for the disease may also be specified. These concepts are illustrated below using the example of diagnosis of hypertension. Test data may be divided into different categories of “disease.” For ease of discussion, this chapter focuses primarily on dichotomous data divided arbitrarily into two mutually exclusive categories: positive or negative. The results of a pregnancy test are either positive or negative; it is not possible to be “a little bit pregnant.” In contrast, some tests, such as blood pressure (BP) measurements for assessment of hypertension, may indicate several levels of abnormality (see below). In some instances, precise definitions can be crucial in making literal life-and-death decisions, such as decisions about tissue antigen matching in organ donation.

200

Brunette-CH15.indd 200

10/9/19 11:37 AM

The Process of Coming to a Diagnosis

Box 15-1 | Principles of diagnostic decision analysis Principle 1: Clinicians should not consider patients as absolutely having a disease, but rather as having only the probability of disease. The probability of disease is based on the prevalence of the disease, the patient’s history (including risk factors, symptoms, signs, and previous test results), and the clinician’s previous experience with similar situations Principle 2: Clinicians use diagnostic tests to improve their estimate of the probability of disease, and the estimate following the test may be lower or higher than the probability before the test. Tests should be selected by their ability or power to revise the initial probability of disease. Principle 3: The probability that disease is actually present, following a positive or negative test result, should be calculated before the test is performed. Application of this principle results in fewer useless tests being performed. Principle 4: A diagnostic test should revise the initial probability of disease. However, if the revision in the probability of the disease does not alter the planned course of management/treatment, then the use of the test should be reexamined. Unless the test provides information desired for an unrelated problem, there is no sense in performing tests that will not alter the planned course of management/treatment.

The Process of Coming to a Diagnosis Both clinicians and patients expect that the information gleaned from a clinical examination is truthful, reliable, and useful for providing a diagnosis that will direct a subsequent course of management. But as shown in the case of Ralph earlier, such information can be prone to error and variability. Interpreting the result of diagnostic tools in order to arrive at a diagnosis should be guided by four principles of decision analysis (Box 15-1).12–14 Principle 1 states that in the context of making a diagnosis, patients have a probability or risk of disease; they do not have a disease. The initial probability or risk of a particular disease being present, referred to as the pretest probability, is equivalent to the prevalence of that particular disease in a specified population at a specified period of time. This initial probability (prevalence) may then be revised upward or downward, depending on the patient’s symptoms, signs from the examination, and previous test results as well as the clinician’s previous clinical experience with similar situations. The probability or risk of disease is increased if the patient has one or more risk factors for the disease (principle 2). Hence, the patient is assigned post–patient interview probability of likely having the disease. After completing the patient history and clinical examination, the clinician may be confident that a particular disease is present. Therefore, as per principle 3, further investigation or testing for diagnostic purposes is not required, and the appropriate treatment should proceed without delay. In similar fashion,

if the clinician is confident that a particular disease is not present, then further testing is not warranted. However, the clinician may remain undecided as to the presence or absence of a particular disease. The clinician once again applies principle 2, and diagnostic tests may now be considered to revise, either upward or downward, from the post–patient interview probability. However, the measurements, assays, and/ or diagnostic test results cannot confer 100% certainty as to the presence (ie, positive result) or absence (ie, negative result) of disease, as there is a risk of false-positive or false-negative results. Hence, test results (either positive or negative) can only revise (upward or downward) the pretest probability. Once a test result is obtained, both clinician and patient must accept and deal with the result and ignore their own biases. For example, a dental student might interpret a result so that it gives him/her the opportunity to perform a procedure required by the program for graduation. A patient may hope that an expensive treatment is not required. Either way “cherry-picking” the desired result would violate the principle of objective testing. To make decisions objective and appropriately apply principle 4, a threshold approach is often used (Fig 15-1): 1. For each particular disease or condition, the clinician sets a threshold for testing, known as the test threshold. 2. The clinician sets a second threshold for treatment, known as the test-treatment threshold. These thresholds represent cutoff probabilities for ruling in or ruling out a particular disease. The threshold

201

Brunette-CH15.indd 201

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Test threshold Zone 1 0%

Test-treatment threshold Zone 2

30%

Zone 3 100%

70%

50%

Disease is highly likely absent

Disease may or may not be present

Disease is most likely present

Do not test

Proceed with appropriate tests

Do not test

Do not treat

Treatment based on test results

Proceed with treatment

Fig 15-1 Threshold approach to decision analysis and consequent actions.

values depend on the disease and the subsequent consequences/course of management related to either ruling in or ruling out the disease. False-positive and false-negative results have consequences that must be weighed in each individual’s case. The test should not be performed if it is not powerful enough to revise the pretest probability, so that either a positive or negative result would alter the pretest-planned course of action. 3. When the pretest probability falls between the test threshold and test-treatment threshold, then testing is indicated, and treatment should proceed on the basis of the test results. In general, diagnostic tests are most useful when the pretest probability falls roughly between 30% and 70%.15–17 No diagnosis is absolute. Diagnoses are merely probabilistic. The process of coming to a diagnosis involves performing a series of tests, where after each test clinicians ask themselves, “Do I do something, nothing, or get more information?”18 This iterative process continues until the clinician feels that s/he has reached a level of confidence (probabilistic threshold) that no further testing at this time will change the decision to either do something (treatment) or nothing (principle 4). Clinicians may decide to do nothing because they feel the disease is self-limiting and will get better on its own, or because the test results are vague and inconclusive (eg, unlocalized dental sensitivity or cracked tooth) so that time is needed for the disease to progress further before a diagnostic test (percussion, bite stick test) is sensitive enough to identify and localize the problem disease (Fig 15-2). The threshold approach to decision analysis is illustrated by the following examples of hypothetical patients A, B, and C for the disease of pulpal pathology

and the test of periapical radiograph. These patients fall into one of three zones of decision making. In zone 1 (see Fig 15-1), the pretest probability of disease is below the test threshold. Patient A describes the sudden onset of pain to cold, sweet, or sour foods and beverages. These symptoms involve the maxillary anterior teeth, and patient A expresses concern about the need for root canal treatments. Patient A denies a history of trauma but reports the occasional use of an orthodontic retainer and the recent application of at-home tooth-whitening products daily over the past week. Examination reveals an unrestored dentition with localized facial cervical recession. At this time, root sensitivity caused by recent application of bleaching products to exposed dentin is the most probable diagnosis. Pulpal pathology is most likely absent. Information obtained from a periapical radiograph would not alter the diagnosis or alter the course of management (topical application of desensitizing agents). Even a positive test result (eg, widened periodontal ligament space around the root apex) would not alter the posttest probability to a level that would justify endodontic treatment. Hence, neither endodontic treatment nor further testing should proceed. In zone 2 (see Fig 15-1), patient B presents with complaints of intermittent pain and gingival swelling along the facial aspect of the mandibular right central incisor. Examination reveals a well-maintained dentition and a fistula with obliteration of the mucobuccal fold adjacent to the incisor, which has a large cervical restoration. Periodontal probing around the incisor ranges from 2 to 6 mm, but it is not possible to probe the sulcus/ pockets and communicate directly with the fistula. Patient B reports pain to percussion and delayed and mild sensation to cold stimulation. A necrotic pulp and/ or periodontal abscess may be present. A radiograph

202

Brunette-CH15.indd 202

10/9/19 11:37 AM

The Process of Coming to a Diagnosis

Chief complaint

Test First for the most likely diagnosis

Test First next most likely diagnosis

Pretest probability Start with prevalence Pretest probability for next test

Diagnostic test (Sn, Sp)

Calculate Positive predictive value (PPV)

Calculate Negative predictive value (NPV)

No

No

Greater than diagnostic threshold?

Yes

High NPV?

Yes

Treatment If treatment is likely to make patient better off Or No treatment Allow the disease to take its natural course

Fig 15-2 Clinical diagnosis—Decision algorithm. Sn, sensitivity; Sp, specificity.

with a gutta percha point inserted into the fistula is warranted, because it may provide useful information for diagnosis and further management. In zone 3 (see Fig 15-1), patient C reports no current pain involving the mandibular left first molar but also

mentions a history of a “toothache” that stopped after the tooth “broke.” Examination reveals that the molar has fractured lingual cusps, with visible gross caries involving the pulp chamber. Patient C denies any pain resulting from cold stimulation. Caries, tooth fracture,

203

Brunette-CH15.indd 203

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

and necrotic pulp are diagnosed without the need for radiographs. However, a radiograph is required to guide prognosis and further treatment, either extraction or endodontic therapy.

Diagnosing, Screening, Surveillance Thus far the discussion has been limited to the use of diagnostic tests to identify what is manifesting the patient’s chief complaint. However, in clinical practice, the dentist may perform a diagnostic test for the early diagnosis of a disease in the asymptomatic patient, or, to monitor the status of a patient already diagnosed with a disease. To return to Ralph, he decided not to take Dr. Braniac’s advice and instead seek a third opinion with another dentist, Dr Goody Hands, a University of Toronto graduate who failed oral physiology twice and barely passed neuroanatomy but won the dean’s prize for demonstrating the best manual dexterity during her 4 years of dentistry. Not wishing to bias Dr. Hands’s diagnosis by describing his discussions with Dr Branianc and Dr Smart, Ralph describes his symptoms only as sensitive teeth triggered by sweets and breathing in cold air. Dr Hands suspects the symptoms are the result of a caries lesion, and she performs a diagnostic test (ie, bitewing radiograph) to confirm or rule out her suspicion. An example of using a diagnostic test to screen for a targeted oral disease before symptoms occur is illustrated by Sandra, a new patient to Dr Hands’s practice. Sandra had not seen a dentist for several years and would like “a check-up” and her teeth cleaned. Although Sandra does not complain of any symptom (no chief complaint), Dr Hands is concerned that she may be at risk of dental caries because of the high prevalence of caries in the non-fluoridated water community and the length of time that has passed since Sandra had her oral cavity checked by an oral health professional. If a significant interproximal lesion is identified with a bitewing radiograph, then Dr Hands will likely excavate and restore the lesion before it becomes an endodontic problem. In this case, the bitewing radiograph is used to screen for possible brewing disease. Screening serves the purpose of identifying pathology early for better prognosis, minimizing patient morbidity, and hopefully reducing complexity of treatment, resulting in lower cost to the patient. Dr Hands’s next patient of the day is Eric. Eric is a long-term adult patient of the dental practice who is fastidious about regular professional maintenance by

a hygienist. A bitewing radiograph taken a year ago showed an incipient interproximal shadow on the mesial surface of tooth no. 47 that has yet to penetrate into the cementoenamel junction. Rather than excavate and restore the lesion, Dr Hands recommended Eric rinse daily with 0.05% sodium fluoride, in addition to reducing the frequency of refined carbohydrates in his diet, with the hope of at least preventing the further spread of the lesion into the tooth structure and at best reversing the lesion via remineralization.19 During this appointment, updated bitewing radiographs will be taken to assess the current status of the lesion. The purpose of the radiograph is neither diagnostic nor screening for any new lesion but rather to monitor the status of a known existing lesion (ie, surveillance). These three clinical scenarios involve accurately employing the same diagnostic tool, the bitewing radiograph, but for three different reasons: diagnosis (for Ralph), screening (for Sandra), and surveillance (for Eric). The interpretation of the outcome of a diagnostic tool differs because the likelihood of disease before the diagnostic test was given differs for each case. In the first case, Ralph comes in with the classical symptoms of pulpitis suggestive of caries, therefore the probability that the patient has a significant caries lesion that needs immediate treatment is much higher than in the case of asymptomatic Eric, who is being monitored for a known incipient lesion (surveillance) and even less likely for the asymptomatic new patient Sandra being screened for any existing lesions.

The Diagnostic Tool Versus the Gold Standard Diagnostic Tool A diagnostic tool is any device or action used to assess the current health status of a patient. For example, a clinician simply querying a patient about nocturnal paroxysmal pain is a diagnostic tool used to assess the possibility that the patient complaint is a result of an irreversible pulpitis.20 Similarly a patient admitting to smoking is a screening test for the possibility of oral cancer.21 The word “possibility” is used to stress that a patient’s affirmative response does not conclude with absolute certainty an endodontic event or oral cancer but may guide the clinician to conduct further diagnostic or screening tests in the direction of possibly needing root canal therapy or a referral for a biopsy, respectively. In theory, the true nature of a disease can only be determined via a theoretical flawless gold standard

204

Brunette-CH15.indd 204

10/9/19 11:37 AM

The Diagnostic Tool Versus the Gold Standard Diagnostic Tool

Disease

POS test Diagnostic test

Disease

POS test Diagnostic test

False True False POS POS NEG No Disease a

False POS

True NEG b

POS test Diagnostic test

False POS

Disease

True POS

True POS

False NEG

True NEG c

POS test Diagnostic test

Disease

True POS

False NEG

True NEG d

Disease

True NEG e

Fig 15-3 The diagnostic universe. (a) The population is made up of people who have the targeted disease (green circle) amongst people who are free of the diseased (no disease). (b) A diagnostic test performed on the population will test positive (light green circle) on some of the people. The intersection of the light green and green circles indicate the diseased people who were correctly identified by the diagnostic test (True POS). The False POS indicates those disease-free people who incorrectly tested positive. The False NEG indicate those people with the targeted disease who were not picked up by the diagnostic test. Disease-free people who correctly tested negative are referred to as True NEG. (c) A poor diagnostic test reflected by its very small proportion of True POS and its large proportion of False POS and False NEG. (d) A better diagnostic test because it identifies a large majority of the people with the target disease with a small proportion of False POS and small proportion of False NEG. (e) Gold standard test. Errorlessly identifies everyone in the population with the targeted disease.

diagnostic tool. Although in reality no such perfect tool exists, there are certain tools (eg, a histologic investigation) that are universally agreed to have such a low risk of error that the outcomes of all other diagnostic tools are compared to it. It is understood that the diagnostic test comes with its own risks of incorrectly testing positive when the disease does not exist (ie, a false positive) or incorrectly testing negative when the disease truly does exist (ie, a false negative). This can be illustrated through a series of Venn diagrams.3 There are two groups (Fig 15-3): (1) those people who have the disease and (2) those people who do not (ie, healthy people). Once the test is given, there are various possible subsets. There are the subsets of healthy people who correctly test negative and the people with the disease who correctly test positive, as represented as the intersection between the set of people who have the disease and the set of people who test positive. Then there are the subsets of healthy people who incorrectly test positive (false positive) and the subset of people with the disease who incorrectly test negative (false negative).

Given the variability in diagnostic value of various tests, the clinical interpretation of test results is a challenge of probability assessment for every clinician because everyone is practicing in a world of uncertainty. Measuring probability is the process of counting events. Specifically, it is the quotient of the number of times a specific event will occur over the number of times all possible events of the same type can occur. Probability takes on a number between 0 and 1 or a percentage between 0% and 100%. A probability of 0 or 0% indicates with absolute certainty that the event will never happen, while the absolute certainty that an event will occur has a probability of 1 or 100%. The mathematical notation of the probability of event A occurring is given as p[A]. For example, the prevalence of a disease is a measure of the probability of the disease in a population. The prevalence of untreated caries in an American adult has been reported as 26% and is depicted mathematically as p[caries] = 26% (or 0.26).22 Therefore, in a group of 100 people from anywhere in the United States who will randomly walk into a clinic today, about 26 will likely have an untreated caries lesion.

205

Brunette-CH15.indd 205

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Table 15-1 | Census data Population

Number of people

Reference

Population of the US

316,497,531

25

Number of females in the US [ ]

160,756,163

25

Number of hygienists in the US [HYG]

174,100

26

Number of female hygienists in the US [ HYG]

166,788

27

Table 15-2 | Probability calculation of females and female dental hygienists in the United States Population

Notation

Numerator

Denominator

Probability

p[ ]

160,756,163

316,497,531

0.508

p[HYG]

174,100

316,497,531

0.000550

Females given hygienists

p[ /HYG]

166,788

174,100

Hygienists given females

p[HYG/ ]

166,788

160,756,163

Females in the US Hygienists in the US

Some probabilities are given with conditions associated with them. For example, the prevalence of severe periodontitis in children is a conditional probability, so what is the probability that the next child who walks into an office from anywhere in the United States will have advanced periodontitis? Although the prevalence of moderate to advanced periodontitis in adults is reported to be 5.08%,23 in children, it is less than one-tenth as prevalent.24 Conditional probability notation is given as the following: • p[A/B] = probability of event A occurring given that event B is certain • p[B/A] = probability of event B occurring given that event A is certain Although we all agree that p[A/B] and p[B/A] are different, it is important to understand why and how they are different. For example, imagine that you are in an empty room and know that in the next 5 minutes, someone from anywhere in the United States will walk through the door. What is the probability that this person will be a woman? The probability is the quotient of the number of all women in the United States (numerator) divided by the total population of the United States (denominator). Using the data from Tables 15-1 and 15-2, the probability that a woman will walk into the room in the next 5 minutes is p[ ] = 0.508 (ie, 50.8%).25 The probability that the person will be a hygienist (HYG) is p[HYG] = 0.0005 (or 5 hygienists in every 10,000 people in the United States).26

0.958 0.00104

Now let us change the conditions (ie, the numerator and denominator) and imagine that you know with certainty that the next person who walks through the door is a dental hygienist. The probability that this person will be a woman is now p[ / HYG] = 0.958.27 Note that by changing the condition of the denominator from all people in the US to only all hygienists in the United States, the probability significantly changes, from about 51% to 96%. Now let us invert the condition to one where you know with certainty that a woman will walk into the room. The probability that she will be a hygienist is p[HYG / ] = 0.001 (ie, 1 in 1,000 women in the United States is a hygienist). Note how, by reversing the condition between hygienist and woman, the probabilities of p[ / HYG] and p[HYG / ] differ dramatically. It is important to appreciate this concept as the discussion moves to the diagnostic tool’s properties of sensitivity and specificity, pretest and posttest probabilities, and a screening test’s predictive values. See Table 15-3 for the summary of probability terms and concepts used in screening and diagnosis.

Reliability of Measurements As noted previously, patients and clinicians have the reasonable expectation that measurements are reliable. But as we saw in Ralph’s case, intra-observer agreement of a diagnostic test giving a consistent reading by the same dentist (eg, Dr Sally Smart) or the same test giving consistent readings by two independent dentists,

206

Brunette-CH15.indd 206

10/9/19 11:37 AM

Reliability of Measurements

Table 15-3 | Diagnostic terms: Definitions and mathematic notations. Term

Symbol

Pretest probability

Nosographic rates

Diagnostic rates (posttest probability)

Definition

Notation

ppretest

Probability of the individual having the disease (D) before administering the screening test

ppretest[D]

Sensitivity

Sn

Probability of testing positive (+) for individuals known to have the targeted disease

p[+/D]

Specificity

Sp

Probability of testing negative (–) for individuals known to not have the targeted disease (ND)

p[–/ND]

Positive predictive value

PPV

Probability of having the targeted disease for individuals known to test positive

p[D/+]

Negative predictive value

NPV

Probability of not having the targeted disease for individuals known to test negative

p[ND/–]

like Dr Smith and Dr Brainiac (ie, inter-observer agreement) do not always occur. Reliability refers to the reproducibility or ability to obtain the same measurement consistently over sequential measures. The most direct way to determine reliability is to measure the same things with same device several times and compare the results. Note that in our example of measuring BP, the “classification is based on the average of two or more properly measured, seated BP readings on each of two or more office visits.”28 A measure is reliable when the variation or random fluctuation due to errors in measurement is small. Measurements are not reliable when there are extraneous factors that may be unknown or difficult to control. Reliability of a measurement may be affected by three sources of variability: (1) the system or phenomenon being measured; (2) the examination itself, such as the examination environment, equipment, or instruments; or (3) the examiners.

Reliability versus validity Reliability and validity do not refer to the same concept, but they are related. One way of thinking about reliability and validity is that of whether the measurement hits or is close to the true value represented as the bull’s-eye on a target. A valid measure is one that produces values close to the bull’s-eye. A measurement that is swamped by systematic error (see chapter 13) is not valid; it consistently gives the wrong value. A

reliable measure is one that produces values that are close together, as in test-retest procedures. However, it should be noted that the situation modeled in the category valid not reliable diagram in which the values are dispersed over the whole target is not useful for making inferences, and the more general rule is that for a measurement to be valid it must be reliable. Another way of illustrating the relationship between reliability and validity is given in Fig 15-4, which employs a dental example. Consider a student’s performance in examinations. There is probably some true score that accurately indicates a student’s knowledge of the course’s content, but the actual grade received depends on the following: • Variations in performance during the test period (eg, some students might get tired during the test) • Variations in performance from day to day (eg, the student might be sick, hungover) • Variations in the conditions under which the examination is administered (eg, time of day, room ventilation) • Sample of test items used (eg, all the questions could be on topics the student has not studied) These factors lead to the examination grade not reflecting the student’s knowledge and can be considered sources of error in the estimation of the student’s knowledge. The larger these factors are, the less reliable the examination grade is as a measure. Similarly, in dentistry test results may vary because of variation in the system or phenomenon being measured.

207

Brunette-CH15.indd 207

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Validity a

High

Low

b

High True value

True value

Reliability c

d

Low True value

True value

Variation in the system or phenomenon being measured The phenomenon being measured may exhibit normal biologic variability. For example, throughout the day and under different circumstances (eg, body position, stress, and exercise), BP and pulse rate fluctuate; similarly, diurnal and menstrual cycles cause hormonal fluctuations. Moreover, the very act of measurement may influence or alter the phenomenon being measured so that repeated measurements (test and retest) are not reproducible (not reliable). When patients are asked to open their mouth as wide as possible, they may not be able to do so at first attempt. After several attempts, the interincisal distance may increase as a result of stretching or decrease because of fatigue. The act of repeated mandibular movements may also affect other clinical variables for assessment of TMDs such as tenderness to muscle palpation and assessment of temporomandibular joint (TMJ) sounds so that findings may not be stable in either the short or long term.29 Some phenomena, such as BP, will demonstrate “regression toward the mean,” by returning to usual levels over time.30 Therefore, some phenomena require repeated evaluations over time before a diagnosis is finalized. In fact, the inherent variability of physical attributes associated with many dental conditions is responsible for lowering the reliability of some diagnostic tests.

Fig 15-4 The complex relationship between validity and reliability. (a) High validity and high reliability: eg, a calibrated digital ruler used by calibrated, trained examiners to measure root length of an incisor on a periapical radiograph obtained using the parallel technique. (b) Low validity and high reliability: eg, a calibrated digital ruler used by calibrated, trained examiners to measure the root length of an incisor on a periapical radiograph obtained using the bisecting angle technique. (c) High validity and low reliability: eg, a calibrated digital ruler used by calibrated, trained examiners to measure dimensions of a block of dry ice at room temperature. (d) Low validity and low reliability: eg, a group of untrained examiners use their fingers’ widths to determine a patient’s maximal interincisal opening over repeated jaw openings.

can be split, and measurements can be made on the subsamples. For example, the mass NB10 at the American National Bureau of Standards is supposed to weigh exactly 10 g. Eleven determinations of the weight of NB10 gave a value of 9.9995982 g, with 95% confidence limits of ± 0.0000023.31 We can see that the reliability of weight determination is very high. This is expected, because elaborate instrumentation for weighing objects has been developed, and the observations are performed under well-specified conditions. Typically, the results are expressed as a standard deviation (SD) of the individual values or as a confidence interval around the calculated mean. The incorrect function or use of measuring devices or instruments may also be a source of variability. Reliable recording of BP depends not only on a calibrated sphygmomanometer but also on correct positioning of an appropriate-sized cuff onto the arm and proper inflation of the cuff.

Reporting variability Coefficient of variation In some instances, it is of interest to express the variability of the measurement relative to the size of the quantity being measured. In such cases, the coefficient of variation (CV) is used, which is defined as: CV =

Variability from examination equipment and environment Laboratories often assess the precision of their methods by running repeated determinations on the same sample. If the method destroys the sample, the sample

s X

It is often expressed as a percentage, ie, CV =

s × 100% X

– where s = SD of the measurements, and X = mean of replicate measurements.

208

Brunette-CH15.indd 208

10/9/19 11:37 AM

Reliability of Measurements

We can see from the formula that CV is a measure – of the relative variability of a sample. Because s and X have the same units, the ratio (ie, CV) has no units at all. The CV may only be calculated for ratio-scale data.

Precision It is important to distinguish the precision of a measurement from the reliability of the measurement. The precision of a measurement refers to the exactness or degree of refinement with which a measurement is stated. Clinicians may measure the anatomical root length on a radiograph to the nearest half millimeter with a Boley gauge. This measurement could be made electronically, using tools with more precision and measurements to the nearest hundredth millimeter. However, the use of higher precision would likely not translate to higher reliability scores when the biologic variation is large and may not be clinically relevant or necessary. Precision, by itself, does not confer reliability.

Variability of the examiner(s) Biologic variation among examiners accounts for variable acuity of their senses (eg, sight, touch, hearing), which may be further affected by their mood and sleep status. Examiners may be inexperienced or incompetent. They may replace evidence with inference, which can close a clinician’s mind to other diagnoses.3 Consider a patient who describes difficulty in opening her mouth and swelling and pain along the right side of her face, which began several hours after a forceps extraction of an erupted carious maxillary right third molar. The dentist suspects an infection and prescribes powerful antibiotics. Unfortunately, the symptoms do not resolve, and, after 3 days, the patient reports bruising along the right side of her face and persistent discomfort to opening her mouth. A hematoma developed from injection of local anesthetic, and wide opening during extraction resulted in trismus of the masticatory muscles. In this example, the clinician has jumped to the conclusion that the initial symptoms were caused by bacterial infection and failed to consider the not-uncommon alternatives of hematoma and TMD. Clinicians also tend to diagnose what they expect or hope to find (ie, confirmation bias)9; hence, a clinician’s mindset may affect the diagnosis. For example, in addition to the histomorphology of the tissue on the slide, pathologists may be influenced by other factors

when arriving at a diagnosis. The clinical data may be incorrectly weighted or “double counted” in the pathologist’s diagnosis if the pathologist has knowledge of the patient’s clinical presentation.32 If a pathologist is informed that a biopsy specimen was obtained from the mouth of a heavy smoker and alcohol drinker and from an area of erythroleukoplakia in the mandibular retromolar region, even before the slide is placed on the microscope stage, the suspicion of malignancy would be raised.33–35 In such instances, the pathologist may unconsciously grade the dysplasia or carcinoma as more severe than if the clinical information were not available.32

Measurements of examiner agreement For conditions such as a TMD, biologic assays do not exist, and/or criteria for the condition may not be very specific. In these instances, the best course is to determine if the investigators are consistent in their judgments. There are several approaches to evaluate the reliability of such measurements. For any approach, however, a distinction is made between the situation in which the same person examines the same subjects more than once, called an intra-examiner (or within examiner or intra-rater) comparison, and the contrasting comparison, the inter-examiner (or between examiners or inter-rater) comparison, which occurs when the scores assigned by different examiners to the same subjects are correlated. Calculation of the correlation coefficient In this approach, the scores given by one examiner are plotted against the scores given by another examiner (if an inter-examiner comparison is being made), or by the same examiner at a different time (if an intra-examiner comparison is desired), and the correlation coefficient is calculated. Streiner and Norman36 note that the Pearson correlation coefficient is an inappropriate and liberal measure of reliability that usually overestimates the true reliability. Nevertheless, although theoretically weak, it is widely understood and commonly used. Another use of a Pearson correlation coefficient involves measuring a number of objects, or patients, by the same technique at different times and is best used if the data are in the form of continuous measurements rather than categorical judgments. This approach is based on regression analysis and gives the extent to which the relationship between two variables can be described by a straight line. The correlation coefficient r has a value of –1 (perfect negative correlation) to +1 (perfect positive correlation). A

209

Brunette-CH15.indd 209

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

if one consistently scored the same amount higher or lower than the other. A number of indices have been devised to counteract this problem. Some researchers prefer to use some estimate of consistency in which the proportion of judgments that are the same is estimated. This approach is commonly used when there are only two categories, such as the presence or absence of a particular sign or symptom.

No. of S mutans on second test × 105

25

20

15

10

5

0

5

10

15

20

No. of S mutans on first test × 105 Fig 15-5 Pearson autocorrelation of Streptococcus mutans counts in the saliva of third-year dental students on 2 successive weeks. Slope ≈ 1; intercept ≈ 1.5; correlation coefficient ≈ 0.65.

value of 0 indicates no relationship. The square of the correlation coefficient (r2) represents the proportion of the variance of values of the dependent variable (y), which can be accounted for by the values of the independent variable (x). An example of a correlational approach is given in Fig 15-5. In this example, individual dental students were assessed for the number of Streptococcus mutans in their saliva on successive weeks. The data are presented as a scattergram in which the number of colony-forming units/milliliter of saliva of each student on the first test is plotted against the number found on the second test. The best straight line for these points has been calculated by the least squares method (explained in detail in Box et al,37 see chapter 13). The correlation coefficient calculated for these data is 0.65. The square of the correlation coefficient indicates the proportion of the variability that can be explained by the correlation. That is, in this instance, 0.65 × 0.65 = 0.42 = 42%. Thus, 42% of the variability in the data can be explained by differences in individuals’ saliva, and the other 58% must be explained by other effects, such as differences in saliva with time, laboratory technique, or sampling variations. Indices of agreement A problem with the use of the correlation coefficient is that two examiners could have a perfect correlation

Proportion agreement Proportion agreement reports the proportion of times that examiners are in complete agreement divided by the number of instances examined. When this ratio is multiplied by 100, it is called percent agreement. If the comparison is made between the observations of one examiner at different times, proportion agreement has been called the repeatability index, ie, the ability of an examiner to repeat the observations. It should be noted that if there is a limited number of categories, this index would be expected to be relatively high. Gingival scores, where there are only three possible choices (0, 1, and 2), would be expected to agree one-third of the time because of chance alone if the three conditions were equally represented in the sample studied. Accordingly, techniques such as the Kappa statistic have been developed to adjust for the contribution of chance to agreement. Kappa value Another often-used measure of how well two independent observers agree is the Cohen’s kappa value, κ, given by38,39: κ=

po – pe 1 – pe

where po is the relative observed agreement among observers, and pe is the hypothetical probability of chance agreement. The comparative reported findings between these two dentists can be presented as a 2 × 2 table (Table 15-4). The columns and rows denote the diagnosis made between Dentist #1 and Dentist #2 using a specific diagnostic tool. Cells a and d reflected the number of times Dentist #1 and Dentist #2 agree that the diagnostic test confirms the disease or not. Cells b and c are the number of times the two dentists have contrary interpretation of the results of the diagnostic tool. The relative observed agreement, po, is essentially the probability (ie, accuracy) of agreement between the two dentists. a+d po = a+b+c+d

210

Brunette-CH15.indd 210

10/9/19 11:37 AM

Reliability of Measurements

Table 15-4 | 2×2 table to calculate the Cohen’s kappa value Dentist #1 Diagnostic tool Dentist #2

Disease

No disease

Disease

a

b

No disease

c

d

Calculating the pe is a little more complicated. It represents the expected probability of agreement if Dentist #1 and Dentist #2 randomly interpreted the results of the diagnostic test as being positive or negative for the disease, given by: pe = pdisease + pno disease where pdisease =

a+c a+b × a+b+c+d a+b+c+d

pno disease =

b+d c+d × a+b+c+d a+b+c+d

For example, suppose two dentists, Dr Sally Smart and Dr Bill Brainiac, are reviewing the same 50 bitewing radiographs for interproximal caries. If they completely agree on the interpretation of all 50 radiographs, then κ = 1. If they give conflicting interpretation on every radiograph, then κ = 0. But, more likely, Sally and Bill will agree on the interpretation of some radiographs and disagree on others. The comparative reported findings between these two dentists can be presented as a 2 × 2 table (Table 15-5), where both Dr Smart and Dr Brainiac agreed that there is evidence of caries on 20 radiographs and no caries on 15 radiographs. However, on 10 radiographs Dr Smart thought there was evidence of caries while Dr Brainiac did not, and on 5 radiographs Dr Smart disagreed with Dr Brainiac that there were any radiographic signs of caries. The relative observed agreement, po, is essentially the probability (ie, accuracy) that Sally and Bill agree on what was and was not caries. 20 + 15 po = 20 + 5 + 10 + 15 = 0.70 They agree 70% of the time. Calculating the pe is a little more complicated. It represents the expected probability of agreement if Sally and Bill randomly interpreted the radiographs as having caries or not.

Table 15-5 | Sample 2×2 table comparing the diagnostic interpretation of bitewing radiographs Viewing bitewing radiographs Dr Bill Brainiac

Dr Sally Smart Caries

No caries

Caries

20

5

No caries

10

15

For example, Sally thought there was caries on 30 radiographs and no caries on 20. On the other hand, Bill thought there was caries on half the radiographs and no caries on the other half. So the expected probability that both Sally and Bill interpreted radiographic caries is given by: pcaries = 30 × 25 = 0.6 × 0.5 = 0.3 50 50 Similarly, the expected probability that both Sally and Bill interpreted no radiographic caries is given by: pno caries = 20 × 25 = 0.4 × 0.5 = 0.2 50 50 Hence, the overall random agreement probability between Sally and Bill is the total probability that they interpreted either caries or not, and is given by the sum of pcaries and pno caries or: pe = pcaries + pno caries = 0.3 + 0.2 = 0.5 Therefore, the Cohen’s kappa value between Sally and Bill at interpreting interproximal caries on the 50 bitewing radiographs is given as follows: κ=

po – pe 1 – pe

=

0.70 – 0.50 = 0.40 1 – 0.50

Note that the Cohen’s kappa value measures the agreement between only two observers. However, when assessing the inter-operator agreement between three dentists (eg, Drs Sally Smart, Bill Brainiac, and Goodie Hands) or more, then the Fleiss kappa value, involving a more complex computation, is used.40,41 Cohen’s and Fleiss kappa statistics are a more appropriate means to evaluate reliability of a diagnostic test between two or more dentists because they adjust for the degree of agreement expected purely by chance.42 Kappa values below 0.4 indicate poor agreement; between 0.4 and 0.75, fair agreement; and values from 0.75 to the maximum of 1, excellent agreement. In studies of the reliability of many kinds of clinical variables in medicine, Koran43,44 found most of the values

211

Brunette-CH15.indd 211

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

to be below 0.35. Fleiss et al45 found kappa to be 0.80 or higher for dental caries.

The effect of method of calculation: An example from caries research The method of calculation can emphasize the differences or agreement between different examinations of the same subject, as can be exemplified from caries research where various indices have been used. For illustration, we will consider a situation in which two investigators independently evaluate the same teeth in different individuals. Three numbers result from such a study: a = number of teeth (or sites) in which the examiners disagreed as to carious status; b = number of teeth (or sites) consistently diagnosed as carious; and c = number of teeth consistently diagnosed as sound. Suppose that in a study of 200 teeth, the investigators agreed that 100 teeth were sound. Another 100 teeth were classed by at least one investigator as carious. The investigators agreed on 95 teeth and disagreed on 5. Therefore, in this example, a = 5, b = 95, and c = 100. Various indices can be calculated from such data. The reproducibility ratio (r): r=

a b

example:

5 = 0.053 95 5

Ideally, r should be as close to zero or as low a value as possible. Percent reproducibility: r% =

a a+b

example::

5 × 100 = 5% 5 +95

Both these indices emphasize diagnostic variability, because they focus only on the teeth diagnosed as carious. Shaw and Murray46 have proposed the use of a modified reproducibility index r’ defined below: r’ =

a b+ c

example :

5 = 0.026 100 +95

This index considers all the teeth, both sound and carious. For many populations, such as those living in affluent areas, the large number of easily diagnosed sound teeth would make this index low, and results would appear quite reproducible. This example illustrates that reproducibility is influenced by the population of objects considered.

The analysis of variance approach to estimating reliability: The intraclass correlation coefficient In this approach, the variance in a population of measurements is partitioned into the effects of the various components of the process including—for example, in the instance that different observers were scoring patients—the effects of patients, observers, and random error. The calculation of the error term involves determining how much each individual score deviates from its expected value. The reliability coefficient (R) is defined as the ratio of variance among 2 2 patients ( σpatient ) of the total of error ( σerror ) variance plus variance among patients (for an example, see Fleiss and Kingman47: R=

σ

2 σpatient 2 + σerror

2 patient

Notice that if the error variance equals 0, the reliability coefficient is 1, and the measurement perfectly reflects the true values for the patients. The previous formula assumes that the scores can be corrected for 2 ). If this is not the effects of different observers ( σobservers the case, the appropriate formula would be: R=

2 σpatient 2 2 2 σpatient + σobservers + σerror

In interpreting reliability coefficients, it should be remembered that because the coefficient is the ratio of subject variability to total variability, tests conducted on heterogenous populations, which have high variability, will tend to give high reliability values. An extensive discussion of reliability measures is given in Streiner and Norman.36

Relationship of the standard error of the measurement to reliability As mentioned in chapter 12, we can consider measured values as: observed value = true value + error. Just as the arithmetic mean of several measurements gives a more precise estimate of the population mean than a single measurement, the mean of several replicate measurements on a subject or specimen is more reliable than a single measurement. The standard error of measurement (SEMt) is equal to the SD of the error components in a set of observed values. If the reliability of the measurement were

212

Brunette-CH15.indd 212

10/9/19 11:37 AM

Reliability of Measurements

Table 15-6 | Reliability values for measurement in dentistry Correlation coefficient

Test

Inter

Kappa value

Intra

Inter

0.32

0.22a

Intra

Percent agreement Inter

Intra

44%

47.5%

Periodontics Gingival redness49

0.61

Plaque49

0.81

Bleeding on probing

64%

51

Lack of bleeding on probing51 Probing depth, general

78% 0.63

b

0.72

b

c

0.26

69%

81.2%d

43%

72%

38.3%

60.9%

b

Dental radiographs Vital/nonvital53 Caries, calibrated rater

54

Periodontal disease, calibrated57

0.73

0.80

0.80

0.79

Interdental bone loss/gain Conventional method55 Periapical condition Normal56

37%

81.5%

Widening of periodontal ligament56

9%

40.2%

Periapical radiolucency56

27%

76.2%

Orthodontics Need for orthodontic treatment55

0.81

Occlusal stability55

0.56

Dentofacial attractiveness55

0.88

Masticatory function55

0.72

Dysfunction index

0.84

0.92

Palpation index57

0.87

0.86

Craniomandibular index57

0.95

0.96

57

Data also from Haffajee et al.50 b Data from Clemmer and Barbano.49 c Data from Fleiss and Chilton.42 d Data from Smith et al.52 a

perfect, then there would be no error component, and the SEMt would equal zero. If the measurement were completely unreliable, the observed values would reflect only the error, and the SEMt would equal the SD of the observed values. Thus, there is a relationship between reliability and the SEMt, which is shown by the formula: SEMt= s (1 – R) where s is the SD, and R is the reliability, which can take values from 0 to 1. A number of strategies can improve reliability. The first strategy, replication, is to increase the number

of measurements on each patient or specimen. In the instance when more than one examiner or rater is used, it may be advantageous to retrain or drop a deviant rater or stratify later in the analysis. Fleiss48 discusses these strategies and other aspects of reliability. Table 15-649–57 summarizes some of the reliability values found in various measurements in dentistry. The values would depend on exactly how the measurements were done. In a sophisticated study, Abbas et al58 used analysis of variance to analyze data on the effect of training and probing force on the reproducibility of interproximal probing measurements made by three different examiners. The training consisted of a video

213

Brunette-CH15.indd 213

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

New diagonistic test results

Gold standard test results Positive (patient really has disease)

Negative (patient really does not have disease)

Positive (patient appears to have disease)

a True positives

b False positives

a+b

Negative (patient appears not to have disease)

c False negatives

d True negatives

c+d

a+c

b+d

program, which demonstrated how to use the probe on a standardized spot and a standardized direction of insertion into the pocket. The study also compared a probe that delivered a constant probing force and a standard Merritt-B probe. The results showed that both training and a pressure probe were required to eliminate significant differences among examiners. Moreover, only nonbleeding pockets gave reproducible measurements.

Diagnostic Tool Properties: Sensitivity and Specificity A diagnostic tool is defined by two properties: its sensitivity (Sn) and specificity (Sp). These properties are a measurement of how well the tool performs against the gold standard test. A diagnostic tool is assessed on how reliable it is at testing positive on individuals known to have the targeted disease and testing negative on individuals known to not have the targeted disease, as confirmed with the gold standard test. By using a 2 × 2 contingency table (Fig 15-6), a direct comparison between the diagnostic test and the reference (gold) standard test is made. By convention, the columns of the 2 × 2 table denote the health status (ie, diseased [D] or not diseased [ND]) and the rows describe the outcomes of the diagnostic device (ie, positive test result [+] or negative test result [–]). From this table the parameters sensitivity, specificity, and predictive values can be calculated. The Sn and Sp are referred to as nosographic probabilities that reflect how reliable the diagnostic device is

a+b+c+d

Fig 15-6 Contingency table showing comparison between gold standard (GS) and a new test (NT). For instance, for the disease of caries, the GS is histologic examination, and an NT for diagnosis of caries may be a laser device. Positive result with GS means the patient really has disease. Negative result with GS means the patient really does not have disease. Positive result with NT means the patient appears to have disease. Negative result with NT means the patient appears to not have disease.

at making these identifications.3 Noso comes from the Greek noso meaning “disease.” Hence, nosographic refers to something that describes a disease. Sensitivity and specificity are explicit test responses of a person with and without the disease, respectively. The Sn is the nosographic true positive probability that the diagnostic tool will test positive [+] on individuals known to have the targeted disease [D]. This is represented by the mathematic notation p[+/D]. Fig 15-7 shows that 94 of the 100 people known by the gold standard test to have the targeted disease [D] actually test positive [+], so Sn = 94%. On the other hand, there are 6 people out of the 100 who have the disease (D) but test negative (–), wrongly identifying them as not having the disease. This is referred to as a nosographic false negative rate (1 – Sn) of 6%. The Sp measures the nosographic true negative probability that people known not to have the targeted disease [ND] will test negative [–] and is depicted mathematically by p[–/ND]. For example, if 84 of 100 people known not to have the targeted disease [ND] test negative (–), then Sp = 84% (Fig 15-8). However, 16 people who do not have the disease incorrectly test positive (+) for it. Hence, the nosographic false positive rate (1 – Sp) is 16%. Table 15-7 gives a summary of published Sn and Sp values of common diagnostic tests used in dentistry.51,58–74 No attempt has been made to review comprehensively the studies in this area. Moreover, the studies differ in the type of patients, criteria for gold standard, and the time span over which the measurements were made. Thus, direct comparison between methods cannot necessarily be made, and this table merely indicates the general range in which measurements fall.

214

Brunette-CH15.indd 214

10/9/19 11:37 AM

Diagnostic Tool Properties: Sensitivity and Specificity

+ Diagnostic tool

94 people test positive (True positive)



100 people known to have the targeted disease (D)

6 people test negative (False negative)

Sn = p[+/D] = 94% Nosographic false negative rate (1 – Sn) = p[–/D] = 6% Fig 15-7 Sensitivity (Sn).

+ 16 people test positive (False positive) Diagnostic tool

100 people known to be healthy (ND)

– 84 people test negative (True negative)

Sp = p[–/ND] = 84% Nosographic false postive rate (1 – Sp) = p[+/ND] = 16% Fig 15-8 Specificity (Sp).

The accuracy of a diagnostic test’s nosographic properties (ie, Sn and Sp) are determined by studying the performances of the test on people known to have the disease and people known not to have the disease, against a reference standard (ie, gold standard). The gold standard test is the theoretical test that absolutely and accurately identifies people with the disease (Sn = 100%) and people without the disease (Sp = 100%). Although no such test exists, some tests, like histologic evaluation, are accepted as being very close. However, regular histologic evaluation of every tooth for a caries lesion or soft tissue lesion for cancer is not clinically

practical because they are too invasive and result in the destruction of the very thing they seek to observe. Moreover, some tests may significantly compromise the morbidity of the patient (ie, caries) and/or they are very costly to perform on everyone when the prevalence of the disease is known to be very low (eg, oral cancer). Therefore, alternative devices that are minimally invasive and relatively low cost are often developed to test for the presence of disease or not. Assessing the accuracy of these alternative devices requires a study that tests how accurate they are at identifying people known to have or not have the targeted disease.

215

Brunette-CH15.indd 215

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Table 15-7 | Sensitivities, specificities, and likelihood ratios (LR) of some diagnostic tests used in dentistry Test

Sn

Sp

LR+

LR–

Reference

Caries—Occlusal Visual

0.78

0.94

13.0

0.23

59

Radiograph

0.56

0.95

11.2

0.46

60

DIAGNOdent (KaVo)

0.75

0.82

4.17

0.29

61

0.23

0.99

23.0

0.78

59

Caries—Interproximal Visual Radiograph

0.36

0.87

2.77

0.74

60

Ultrasound

0.90

0.92

11.3

0.11

62

Bone loss (subtraction radiography)

0.90

0.96

22.5

0.10

63

Gingival redness

0.27

0.67

0.82

1.09

51

Plaque

0.47

0.65

1.34

0.82

51

Bleeding on probing (2 mm, 5/6 threshold)

0.29

0.88

2.42

0.81

58

Bone loss (subtraction radiography)

0.57

0.67

1.73

0.64

64

CBCT

0.60

0.59

1.46

0.68

64

Cold test

0.87

0.84

5.44

0.15

65

Heat test

0.78

0.67

2.36

0.33

65

Electric pulp test

0.72

0.93

10.3

0.30

65

Laser Doppler flow

0.98

0.95

19.6

0.02

65

Pulp oximeter

0.97

0.95

19.4

0.03

65

Conventional intraoral

0.59

0.70

1.96

0.59

66

Digital intraoral

0.56

0.78

2.54

0.56

66

CBCT

0.95

0.88

7.92

0.06

66

0.84

0.64

2.33

0.25

67

0.17–0.82

0.31–0.88

68

Periodontics

Peri-implantitis

Endondontics—Tooth vitality

Apical periodontitis

Vertical root fracture CBCT TMD Pain on opening Click

0.10–0.89

0.14–0.90

68

Crepitus

0.02–0.85

0.30–0.97

68

Disc displacement (MRI sagittal)

0.86

0.63

2.3

0.22

69

Degenerative changes (tomography, sagittal)

0.47

0.94

7.8

0.56

70

Degenerative changes (ultrasound)

0.72

0.90

7.2

0.31

71

216

Brunette-CH15.indd 216

10/9/19 11:37 AM

Positive Predictive Value, Negative Predictive Value, and Diagnostic Error Rates

Table 15-7 | (cont) Sensitivities, specificities, and likelihood ratios (LR) of some diagnostic tests used in dentistry Test

Sn

Sp

LR+

LR–

Reference

0.50–0.99

0.98

Toluidine blue (suspicious lesions)

0.87

0.71

3.00

0.18

72, 73

Autofluorescence (suspicious lesions)

0.90

0.72

3.21

0.14

72, 74

Cytologic testing (suspicious lesions)

0.92

0.94

15.33

0.09

72, 73

Oral cancer Conventional exam/visual (prevalence less than 10%)

72

CBCT, cone beam computed tomography; MRI, magnetic resonance imaging.

For example, to accurately assess the Sn and Sp of toluidine blue at identifying oral cancer would require a study that recruited and then tested the diagnostic device on a sample population of people representative of the type of patients seen in clinical practice. Everyone in the sample would be subjected to the toluidine blue test. Then all tissue that tested positive or negative would be biopsied for histologic confirmation and calculation of the diagnostic test’s respective Sn and Sp. Like any other clinical study, diagnostic accuracy studies are at risk of bias. Inaccuracy in the reported Sn and Sp of a diagnostic tool can occur when the sample test population is not representative of the population to which it will be applied. For example, an autofluorescence device (VELscope Vx Enhanced Oral Assessment System) was introduced to the dental market as a device to screen for oral cancer. The promotion was based on a study by Lane et al that reported a Sn and Sp of 98% and 100%, respectively.74 But these impressive results were based on a small sample of patients (n = 44) from a cancer clinic and not the general population and hence may be an example of spectrum bias. This bias is a systematic error that can lead to an overestimation of the test’s sensitivity and specificity due to performing the test on a special sample of people (ie, oral cancer patients in remission being monitored for a possible relapse of the disease) that are not representative of the type of patient typically seen in general practice.75 For example, a study by Farah et al using a population sample somewhat more representative of the general population reported a much lower Sn and Sp of 30% and 63%, respectively.76

Therefore, it is important to critically appraise the validity of Sn and Sp reported in the study before blindly using them. The STARD statement (Standards for Reporting of Diagnostic Accuracy Studies) was developed to help standardize the reporting of diagnostic test studies.77 It includes a checklist of necessary information to ensure the reporting of accurate nosographic diagnostic test properties.78

Positive Predictive Value, Negative Predictive Value, and Diagnostic Error Rates Although Sn and Sp are important numbers to measure, they do not represent the typical clinical scenario because these nosographic probabilities represent the reliability of the test on people known to have the targeted disease or known not to have it. In clinical practice, the dentist or hygienist does not test people known to have the targeted disease but people whose health status is unknown before they are given the test (Fig 15-9). The purpose of the diagnostic tool is to assess the possibility that a person has the targeted disease based on the results of the test. Therefore, in clinical practice, the dentist or hygienist wants to know one of the following: (1) What is the probability that someone who has tested positive really has the targeted disease? or (2) What is the probability someone who has tested negative really does not have the targeted disease? These

217

Brunette-CH15.indd 217

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

+

Diagnostic tool

How many of those that test positive truly have the targeted disease?



How many of those that test negative truly do not have the targeted disease?

Population being screened

+

Out of 100 positive tests, 70 people are correctly identified as having the targeted disease (diagnostic true positive rate). PPV = p[D/+] = 70%

Out of 100 positive tests, 30 health people are incorrectly identified as having the targeted disease (diagnostic false positive rate). (1 – PPV) = p[ND/+] = 30%

Diagnostic tool

People in the general population, whose health status is not known, are given the diagnostic test.

Fig 15-9 Screening test on people not knowing their status.



Out of 100 negative tests, 10 sick people are incorrectly identified as not having the targeted disease (diagnostic false negative rate). (1 – NPV) = p[D/–] = 10% Out of 100 negative tests, 90 healthy people are correctly identified as not having the targeted disease (diagnostic true negative rate). NPV = p[ND/–] = 90%

Fig 15-10 Predictive values (posttest probabilities) and diagnostic false rates. People in the general population, whose health status is not known, are given the test. For every 100 people who test positive, 70 are confirmed to have the disease (PPV= 70%) and 30 are confirmed to not have the disease (false positive diagnostic rate) via a gold standard test. Furthermore, for every 100 people who test negative, 90 confirmed healthy people correctly test negative for the targeted disease (NPV=90%), but 10 confirmed to have the targeted disease incorrectly test negative (false negative diagnostic rate = 10%).

are referred to as the diagnostic probabilities or diagnostic rates, which reflect on the performance of the diagnostic tool on patients in clinical practice whose health status is not known, as opposed to the nosographic probabilities, which reflect on the tool’s diagnostic performance on

people known to be diseased and people known to be disease free.3 The probability that someone who has tested positive truly has the targeted disease is known as the positive predictive value (PPV) and is noted as p[D/+], with

218

Brunette-CH15.indd 218

10/9/19 11:37 AM

Calculating Postive Predictive Value and Negative Predictive Value

an associated diagnostic false positive rate (1 – PPV). Likewise, the probability that an individual who tested negative truly does not have the targeted disease is referred to as the negative predictive value (NPV) and is described mathematically as p[ND/–], with an associated diagnostic false negative rate (1 – N PV). However, these diagnostic rates reflect the reliability of the diagnostic test in the population being tested3—and the diagnostic rates of a screening test in a population with low disease prevalence (eg, oral cancer screening in the general public) are very different from the rates in a high-prevalence population (eg, dental caries in the general population). The hygienist needs to take into account the risk of the population when deciding if a screening test is appropriate. Figure 15-10 describes the PPV in an example where 70 of the 100 people known to test positive [+] truly have the targeted disease [D]. Because 30 of these 100 people who tested positive do not truly have the disease, the 1 – PPV is 30%. Figure 15-10 also shows an example of the NPV where 90 of 100 people known to test negative (–) are confirmed not to have the targeted disease (ND), with a 1 – NPV of 10%. It is important to appreciate that Sn and Sp (nosographic probabilities) differ from PPV and NPV (diagnostic rate), just as p[ /HYG] and p[HYG/ ] differed in the example given earlier.

Calculating Postive Predictive Value and Negative Predictive Value Example: Toluidine blue staining as a screening test for oral cancer Because the Sn and Sp of the diagnostic device are theoretically measured against the gold standard device, they are a constant. However, the PPV and NPV are a function of the device’s Sn and Sp, as well as the probability of the individual having the targeted disease prior to giving the test, also as previously discussed as the pretest probability of the targeted disease (ie, ppretest[D]). The pretest probability is often estimated to be the prevalence of the disease in the tested population. For many, the concept of pretest probability can take some time to grasp. To appreciate the concept is to understand that, as clinicians, we make decisions in a world of uncertainty. When we meet a patient for the first time, we may have a hunch that something is wrong. As we gather more information on the patient’s

health status through diagnostic and screening tests, our hunch gets stronger that they either have the disease or do not. For example, the dental hygienist will typically include an oral cancer exam on the asymptomatic patient during a routine recall appointment. Essentially, the dental hygienist is screening the patient for oral cancer. The dental hygienist routinely inspects the oral cavity for signs that would increase or decrease the likelihood for oral cancer. The pre–visual examination likelihood is referred to as the pretest probability and, in this case, reflects the prevalence of oral cancer in the population, which in the United States is reported to be 0.08% (ie, p[D] = 0.08%).79 The dental hygienist may want to complement the visual examination with a diagnostic tool such as toluidine blue staining. The patient will test either positive or negative to toluidine blue staining. The dental hygienist will then consider the extent to which the probability of oral cancer will increase for the patient who tested positive, or the extent to which the probability of oral cancer will decrease if they tested negative. These posttest probabilities are reflected in the PPV and NPV of the screening test on the whole US population. Unlike Sn and Sp, which are assumed to be constant properties of the diagnostic tool, the diagnostic rates PPV and NPV will vary depending on the pretest probability and the Sn and Sp of the chosen diagnostic tool. In the case of toluidine blue staining, the Sn and Sp are reported to be about 87% and 71%, respectively (see Table 15-7).63 Knowing the pretest probability is the prevalence of oral cancer in the general population (ppretest[D] = 0.08%) and the Sn and Sp of toluidine blue staining, the dental hygienist can now calculate the PPV and NPV of this oral cancer screening test and their respective diagnostic errors, 1 – PPV and 1 – NPV, using Bayes’ formula or with the aid of a 2 × 2 table. Bayes’ formula for calculating a diagnostic test’s PPV is given by: Sn × ppretest PPV = Sn × ppretest + (1 – Sp) × (1- ppretest) In the case of toluidine blue, the PPV can be calculated as follows: 0.87 × 0.0008 PPV (toluidine 0.87 × 0.0008 + (1 – 0.71) × (1 – 0.0008) blue) = PPV (toluidine blue) = 0.0024 Hence, a positive toluidine blue test for oral cancer only slightly increases the negligible pretest probability from 0.0008% to 0.24%

219

Brunette-CH15.indd 219

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Table 15-8a | 2 × 2 table design Health status Diagnostic test results

Disease (D)

No disease (ND)

(a)

(b)

Test +

Total (a + b)

Test –

(c)

(d)

(c + b)

Total

(a + c)

(b + d)

(a + b + c + d)

Columns refer to the true health/disease status, and rows describe the outcomes of the diagnostic test. a = the number of individuals with the targeted disease who correctly test positive; b = the number of false positives; c = the number of false negatives; d = the number of individuals without the targeted disease who correctly test negative.

Table 15-8b | Formulae Prevalence = p[D] = a + c / (a + b + c + d) Sn = p[+/D] = a / (a + c) Sp = p[–/ND] = d / (b + d) PPV = p[D/+] = a / (a + b) NPV = p[ND/–] = d / (c + d) + = positive test result; – = negative test result; D = disease; ND = no disease.

Bayes’ formula for calculating a diagnostic test’s NPV is given by: NPV =

Sp × (1 – ppretest) (1 – Sn) × ppretest + Sp × (1 – ppretest)

0.71 × (1 – 0.0008) NPV (toluidine (1 – 0.87) × 0.0008 + 0.71 × (1 – 0.0008) blue) = NPV (toluidine blue) = 0.9999 Therefore, a negative toluidine blue test for oral cancer essentially rules out any risk for oral cancer. A more intuitive approach to determine the predictive values, without having to memorize Bayes’ formula, is with the aid of a 2 × 2 table (Table 15-8 and Box 15-2). By convention, the columns of the 2 × 2 table denote the health status (ie, diseased [D] or not diseased [ND]) and the rows describe the outcomes of the diagnostic device (ie, positive test result [+] or negative test result [–]). First, the clinician will assume for ease of calculation, a theoretical test population of 1 million people (Box 15-2a: a + b + c + d = 1,000,000). Knowing that the pretest probability (ie, prevalence) of oral cancer is 0.08%, 800 of the 1,000,000 people in the theoretical test population will have oral cancer (Box 15-2b; a + c = 800), and 999,200 will not (Box 15-2b: b + d = 999,200). The Sn of toluidine blue staining indicates that, of the 800 people known to have oral cancer, 87%, or 696 of them, will correctly test positive (Box 15-2c: a), and 13%, 104 (Box 15-2c: c), will incorrectly test negative.

The Sp of toluidine blue staining indicates that of the 999,200 people known to not have oral cancer, 71%, or 709,432 of them, will correctly test negative (Box 15-2d: d), and 289,768 (Box 15-2d: b) will incorrectly test positive. The table can now be completed. Out of the 1 million people in this population, the total number who test positive is 290,464 (Box 15-2e: a + b) and the total number who test negative is 709,536 (Box 15-2e: c + d). Using the data derived from this 2 × 2 table, the PPV, NPV, 1-PPV, and 1-NPV can now be calculated as follows: PPV = p[D/+] = a / (a + b) = 696 / 290,464 = 0.0024 (0.24%) The diagnostic true positive rate, PPV, is 0.24%, which means that 24 out of 10,000 people who test positive will actually have the disease. Hence, the diagnostic false positive rate, 1 – PPV, is 99.76%. NPV = p[ND/–] = d / (c + d) = 709,432 / 709,536 = 0.9999 (99.99%) Since the diagnostic true negative rate, NPV, is 99.99%, the diagnostic false negative rate, 1 – NPV, of 0.01% indicates that a negative toluidine blue test result is

220

Brunette-CH15.indd 220

10/9/19 11:37 AM

Calculating Postive Predictive Value and Negative Predictive Value

Box 15-2 | 2 × 2 calculations

Health status

Toluidine blue staining

a

Health status

Cancer D

No cancer ND

Total

Test +

(a)

(b)

(a + b)

Test –

(c)

(d)

(c + d)

Total

(a + c)

(b + d)

1,000,000 (a + b + c + d)

Toluidine blue staining

b

Cancer D

No cancer ND

Total

Test +

(a)

(b)

(a + b)

Test –

(c)

(d)

(c + d)

Total

800 (a + c)

999,200 (b + d)

1,000,000 (a + b + c + d)

Health status

Toluidine blue staining

c

Health status

Cancer D

No cancer ND

Total

Test +

696 (a)

(b)

(a + b)

Test –

104 (c)

(d)

(c + d)

Total

800 (a + c)

999,200 (b + d)

1,000,000 (a + b + c + d)

Toluidine blue staining

d

Cancer D

No cancer ND

Total

Test +

696 (a)

289,768 (b)

290,464 (a + b)

Test –

104 (c)

709,432 (d)

709,536 (c + d)

Total

800 (a + c)

999,200 (b + d)

1,000,000 (a + b + c + d)

Health status

Toluidine blue staining

e

Cancer D

No cancer ND

Total

Test +

696 (a)

289,768 (b)

290,464 (a + b)

Test –

104 (c)

709,432 (d)

709,536 (c + d)

Total

800 (a + c)

999,200 (b + d)

1,000,000 (a + b + c + d)

221

Brunette-CH15.indd 221

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Table 15-9 | Relationship between the PPV and NPV of toluidine blue staining and the pretest probability for oral cancer ppretest[D]

PPV (%)

NPV (%)

0

100

0.23

99.8

1

3

99.8

5

14

99

10

25

98

25

50

94

50

75

85

75

90

65

90

96

38

95

98

22

100

100

0

(%) 0 0.08

essentially flawless at correctly identifying individuals without oral cancer. In summary, although the probability of a patient truly having oral cancer after testing positive with toluidine blue staining triples from 0.08% (or 8 in 10,000 people) before the test was given to 0.24% (or 24 in 10,000 people) after the test was given, there is still a 99.76% chance that a positive test will be wrong. In other words, 9,976 out of 10,000 positive tests will actually not be identified with having oral cancer after histologic investigation. Due to its very low positive predictive value and exceedingly high diagnostic false positive rate, arguably, toluidine blue is not a useful tool to screen for possible oral cancer in the general population. Furthermore, the patient who tests negative to toluidine blue will only see a minimal increase in the likelihood of not having cancer, from 99.2% (100% – 0.08%) before the test was given to a posttest probability of 99.9% (NPV). This increase is meaningless, since the patient from the general population was unlikely to have oral cancer to begin with. Hence, even using toluidine blue as a screening tool to specifically rule out oral cancer in the general population is not very helpful. However, keep in mind that PPV and NPV are a function of the pretest probability. For example, if the pretest probability of oral cancer was 10%, the 2 × 2 table analysis would derive much different PPV and NPV values for the toluidine blue test of about 25% and 98%, respectively (Table 15-9).

Diagnostic criterion Distribution of negative sites

Distribution of positive sites TNF TPF FNF FPF

–4 –3 –2 –1 0 1 2 3 4 5 6 Measured change in attachment level (mm) Fig 15-11 Differential criteria yield different estimates of sensitivity and specificity. TNF, true negative fraction; FNF, false-negative fraction; TPF, true-positive fraction; FPF, false positive fraction.

Receiver Operating Characteristic Analysis In discussing sensitivity and specificity in the previous section, we assumed that results would fall neatly into the discrete categories of positive or negative. Many tests, however, yield data with more than two outcomes (see hypertension discussion earlier in the chapter). For example, in assessing whether active periodontal disease is present, changes in attachment level could be set at 1, 2, or 3 mm. One clinician might decide that disease was present when the attachment loss was 1 mm or greater. Another clinician might be more concerned with the variations caused by measurement error and decide with certainty that disease was present only with a loss of 2 mm or greater. These differing criteria applied to the same population of patients would yield different estimates of the sensitivity and specificity of the attachment loss assay for periodontal disease. This situation is illustrated in Fig 15-11, which plots the frequency distribution of positive (active disease) and negative sites against measured change in attachment level. In this example, it is assumed that the true-positive sites have a mean attachment loss of 2.5 mm, and the true-negative sites have a mean attachment loss of 0 mm. The variability of scores within the distributions is assumed to be a function solely of random error (ie, the actual loss is either 0.0 or 2.5). These assumptions are obviously simplistic; in a real population, there would be positive sites that vary in attachment losses.80

222

Brunette-CH15.indd 222

10/9/19 11:37 AM

SpIN and SnOUT

0.6

in

0.4

0.8

M ea n

lo ss

0.2

0.0

False-negative fraction 1 – Sn

si te s

0.4

di st ri bu ti on

1. 0

0.6

0.2

of po si ti ve

0.8

0.0 0.0

-0 .0

True-negative fraction (Sp) 0.8 0.6 0.4 0.2

m m

1.0

2. 2. 5 m 0 m m m

1.0 True-positive fraction (Sn)

The SD of both distributions is 1 mm, which is close to the measurement error observed in many studies. The diagnostic criterion is represented as a vertical line drawn at an attachment level, and it determines the fraction of true-positive, false-positive, true-negative, and false-negative results. If we adopted a diagnostic criterion of 4.0 mm loss before diagnosing disease, we could see that, by drawing a vertical line in Fig 15-11 at 4 mm, the number of false-positive results would be zero, but the number of false-negative results would be very high. In recent years, this problem of establishing diagnostic criteria has been tackled by receiver operating characteristic (ROC) analysis, an approach originating in statistical decision theory and electronic signal detection. In dental research, ROC has been applied to periodontal diagnosis80 and caries detection.81 ROC enables the comparison of tests without any selection of upper or lower reference limits or any particular sensitivity and specificity. An ROC graph plots the nosographic true-positive fraction (TPF) as a function of the nosographic false-positive fraction (FPF). TPF is equivalent to sensitivity, and FPF is equivalent to 1 – specificity. Likewise, the nosographic truenegative fraction (TNF) and false-negative fraction (FNF) are equivalent to the tests specificity and 1– sensitivity, respectively. Each diagnostic criterion gives rise to a single point on the ROC curve. When probability fractions (ie, nosographic probabilities) are used in ROC analysis, the curves are independent of disease prevalence, thus reflecting the performance of the diagnostic system per se. The data from Fig 15-11, in which the mean attachment loss is 2.5 mm, are replotted as an ROC curve in Fig 15-12. The point representing the 2.0-mm attachment level diagnostic criterion is circled. Figure 15-12 also gives the curves that would be obtained if the mean loss were 2.0 mm and 1.0 mm. The area (Az) under the diagonal (when the average loss is 0.0 mm) equals 0.5 (ie, Az = 0.5) and represents the value for chance alone. Discriminatory ability above the level of chance for diagnostic accuracy is indicated by Az values between 0.5 and 1.0. Obviously, as the size of the attachment loss becomes bigger, the decision between positive and negative sites becomes easier to make accurately; this is shown in the ROC analysis by greater area under the curve.

1.0 0.0

0.2 0.4 0.6 0.8 False-positive fraction 1 – Sp

1.0

Fig 15-12 ROC curve from data in Fig 15-11.

SpIN and SnOUT Except for the theoretical gold standard diagnostic tool, the design of a diagnostic tool is a trade-off between its Sn and Sp. A test with high sensitivity will capture almost everyone with the disease at the expense of capturing many who do not (nosographic false positive) (Fig 15-13a). Therefore, a diagnostic tool with a high Sn that tests negative on patients likely means they do not have the targeted disease. The mnemonic for a chairside clinician to remember is SnOUT: If the test has a high Sn, then a negative test suggests ruling out the targeted disease. Similarly, a very specific test will rarely result in a nosographic false positive at the expense of failing to identify many who do have the disease (Fig 15-13b). The mnemonic SpIN is to remind the clinician to generally rule in patients who test positive with a diagnostic tool with a high Sp .

Likelihood ratios and nomograms The likelihood ratio is another metric that offers the clinician a quick intuitive chairside assessment of a diagnostic tool’s utility at ruling in or ruling out the targeted disease. The likelihood ratios are the odds of a diagnostic tool’s nosographic performance. The positive likelihood (LR+) ratio is the odds of the diagnostic tools correctly

223

Brunette-CH15.indd 223

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Disease

POS test Diagnostic test

SpIN False POS

True POS

False NEG

rule IN

Ability to rule in targeted disease when patient tests positive on a highly specific ( Sp) test

a

True NEG a

Ability to rule out targeted disease when patient tests negative on a highly sensitive ( Sn) test

Disease

POS test Diagnostic test

rule OUT

SnOUT

b False NEG True False POS POS

Fig 15-14 SpIN and SnOUT rule of thumb. (a) SpIN: If a patient tests positive with a very specific test, then rule patient in as possibly having the targeted disease. (b) SpOUT: If a patient tests negative with a very sensitive test, then rule patient out as likely not having the targeted disease.

True NEG b Fig 15-13 Venn diagrams illustrating the effects of diagnostic false positive and false negative diagnostic rates with highly specific and sensitive tests. (a) A very sensitive test that identifies almost everyone with the targeted disease but at the expense of significant false positives. (b) A very specific test with low false positives but at the expense of significant false negatives.

testing positive (Sn) over falsely testing positive (1 – Sp). A negative likelihood (LR–) ratio is the odds of the diagnostic tools incorrectly testing negative over correctly testing negative. LR+ = nosographic true positive / nosographic false positive = Sn / (1 – Sp) LR– = nosographic false negative / nosographic true negative = (1 – Sn) / Sp As a rule of thumb, an LR+ greater than 5 may suggest ruling in the targeted disease with a positive

test. A high LR+ usually implies a low nosographic false positive and thus a high Sp (ie, SpIN). On the other hand, an LR– less than 0.2 implies a useful test at ruling out the targeted disease because it likely has a high Sn (ie, SnOUT)82 (Fig 15-14). For example, consider a patient who presents with TMJ pain and dysfunction. Physical examination reveals that the patient has severe pain on palpation of the TMJ area with moderate sounds (clicking and/ or crepitus) on opening. Sagittal magnetic resonance imaging (MRI) has an LR+ of 2.3 and an LR– of 0.22 (see Table 15-7). Hence, a negative sagittal MRI may be helpful at ruling out a displacement of the disk. However, positive sagittal tomography (LR+ = 7.8, LR– = 0.56) may guide the clinician to ruling in possible degenerative TMJ pathology and performing more tests to test this diagnostic hypothesis. Likelihood ratios are used with nomograms as a fast and convenient chairside alternative to the 2 × 2 table at calculating posttest probabilities (Fig 15-15). Figure 15-16 demonstrates use of the nomogram in the diagnostic decisions for three patients with potential interproximal caries (the disease) and use of DIAGNOdent (KaVo) laser device (the test). Figure 15-17 summarizes these results. For the detection of approximal caries, the DIAGNOdent device has a reported

224

Brunette-CH15.indd 224

10/9/19 11:37 AM

SpIN and SnOUT

sensitivity of 0.75 and a specificity of 0.82 (see Table 15-7). LR+ is calculated as 4.17; LR– is 0.29. In each case, the clinician detects a small area of discoloration on the distal aspect of the mandibular first premolar, but the clinician is not able to engage the explorer interproximally. For the disease of caries, the clinician has assigned a test threshold of 30% and a test-treatment threshold of 65% (see Fig 15-1). Patient A (see Fig 15-16a) is an adolescent female with an unrestored permanent dentition who practices excellent oral hygiene and is very compliant with twiceyearly prophylaxis appointments. Bitewing radiographs taken 1 year ago at the completion of orthodontic treatment do not reveal any abnormalities. The clinician assigns a pretest probability for caries of 1%. The clinician’s pretest probability is located well below the test threshold of 30%; therefore, application of the DIAGNOdent is not indicated. In the event that the test (DIAGNOdent) yielded a positive test result, the probability of caries (PTL[+]) can be calculated by aligning the straightedge at 1% in the pretest probability column with an LR+ of 4.17 in the LR column. The posttest probability of caries present is raised to < 5%. Despite this positive test result, no further tests or restoration would be indicated, because this probability—although higher than the pretest probability of 1%—is still much lower than the test threshold of 30%. If the test results were negative, then the posttest probability of Patient A having caries (PTL[–]) can be calculated, using the LR– of 0.29 in the LR column, to be < 0.2%, effectively ruling out the presence of caries. Patient B (see Fig 15-16b) is a young adult with a moderately restored posterior dentition who has been traveling the world for several years and demonstrates poor oral hygiene and poor compliance with recommended dental recall and prophylaxis appointments. The patient was last seen 3 years ago, when bitewing radiographs revealed three sites of interproximal caries. The clinician assigns a pretest probability of 50% to the presence of caries. This pretest probability is located between the test and test-treatment thresholds; therefore, additional information (testing) is indicated. With a positive test result (PTL[+] > 80%), treatment is indicated, but a negative test result (PTL[–] < 22%) rules out the disease and treatment. Patient C (see Fig 15-16c) is an elderly patient with a heavily restored dentition and recent history of recurrent and new caries. The patient is being treated for depression, and the medication has caused xerostomia; although she is a compliant patient, Patient C demonstrates poor oral hygiene. The clinician assigns a pretest probability for caries at 95%, and treatment is

0.1

99

0.2

0.5 1

95 1000

90

500 2

200

80

100 5

10

20

50

70

20

60

10

50

5

40

2

30

1 30

.5

40

.2

50

.1

60

.05

70

.02

20

10

5

.01 80

.005

2

.002 90

.001

95

1 0.5

0.2

99 Pretest probability %

0.1 Likelihood ratio

Posttest probability %

Fig 15-15 Nomogram converting pretest and posttest odds to their corresponding probabilities. To use the nomogram, a straightedge must align with the pretest probability (left-hand column) with the LR (center column) of the test being used. The posttest probability is revealed by reading across the straightedge to the right-hand column on the nomogram.

225

Brunette-CH15.indd 225

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

0.1

0.1

99

0.2

0.2

0.5 1

0.5

95 1000

1

90

500 2

10

%

20

200

2

80

50

70

20

60

10

50

5

40

2

30

5

10

%

30

.5

40

.2

50

.1

60

.05 .02

%

20

10

5

20

200

80

.005

.001

95

50

70

20

60

10

50

5

40

2

30 %

30

.5

40

.2

50

.1

60

.05

70

.02

2

.005

1

.001

95

2

99

Posttest probability

a Fig 15-16 The DIAGNOdent is reported to have an LR+ of 4.17 and an LR– of 0.29 (a) Patient A is assigned a 1% pretest probability of having caries. A positive DIAGNOdent test indicates a posttest probability of Patient A having caries to be less than 5% (dark green line). The posttest probability of Patient A having caries with a negative DIAGNOdent test drops to less than 0.2% (light green line).

1

0.2

0.1 Likelihood ratio

5

0.5

0.2

Pretest probability

10

.002 90

0.5

99

20

.01 80

.002 90

90

1

.01 80

1000

100

1

70

95

500

100 5

99

Pretest probability

0.1 Likelihood ratio

Posttest probability

b Fig 15-16 (cont) (b) Patient B is assigned a 50% pretest probability of having caries. A positive DIAGNOdent test indicates a posttest probability of Patient B having caries to be over 80% (dark green line). The posttest probability of Patient B having caries with a negative DIAGNOdent test drops to less than 22% (light green line).

226

Brunette-CH15.indd 226

10/9/19 11:37 AM

Returning to Clinical Decision Making

0.1

99

Pretest probability

0.2 Patient A 0.5 1

1.0%

95 1000

Patient B

90

50%

500 2

200

10

%

20

50

70

20

60

10

50

5

40

2

30

No treatment Observe only

0.2% Negative result

No treatment

81% Positive result

Treatment

22% Negative result

No treatment Observe only

Patient C

95%

98% Positive result

Treatment

84% Negative result

Treatment

Fig 15-17 Use of LRs and the nomogram in diagnostic decisions for three patients with possible caries who have undergone DIAGNOdent laser test. %

1 30

.5

40

.2

50

.1

60

.05

70

.02

20

10

5

.01 80

4% Positive result

80

100 5

Posttest probabilities

.005

2

indicated without further diagnostic testing. That is, in this case, DIAGNOdent is not required to establish the diagnosis of caries, although radiographs may provide useful information to guide treatment of the caries or diagnosis/treatment of other pathologies. For Patient C, even a negative test result would still result in > 80% posttest probability of caries being present, requiring treatment. This case illustrates that when the probability of disease is high, clinicians must be careful not to overestimate the meaning of negative test results.

.002 90

.001

95

1 0.5

0.2

99 Pretest probability

0.1 Likelihood ratio

Posttest probability

c Fig 15-16 (cont) (c) Patient C is assigned a 95% pretest probability of having caries. A positive DIAGNOdent test indicates a posttest probability of Patient C having caries to be over 98% (dark green line). The posttest probability of Patient C having caries with a negative DIAGNOdent test drops to less than 84% (light green line).

Returning to Clinical Decision Making In clinical practice, the outcome of a diagnostic test will help guide the clinician either to do nothing (if confident that a negative test is highly reliable), to do something (if the clinician reaches his/her diagnostic threshold), or to seek further information through another diagnostic test (if still uncertain). The threshold set by the dental professional will depend on the consequence of doing nothing when something should have been done (due to a diagnostic false negative rate) or the consequence of doing something when nothing should have been done (due to a diagnostic false positive rate). In the case of oral cancer screening, a

227

Brunette-CH15.indd 227

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

Table 15-10 | Relationship between the PPV and NPV of bitewing radiographs and the pretest probability for interproximal dentin caries (Sn = 36% and Sp = 87%) Ppretest[D] (%) 0

PPV (%) 0

NPV (%) 100

1

5

99

5

23

96

10

38

93

25

65

81

50

85

59

75

94

32

90

98

14

95

99

7

100

100

0

false negative may be life threatening. However, the consequence to the patient of acting on a false positive is the cost and pain of a biopsy and the weeks of fear and stress waiting to find out that the patient did not have oral cancer to begin with. For example, it may be hard to justify the routine use of toluidine blue staining as a screening device in general practice. Despite its impressive Sn and Sp, the prevalence of oral cancer is so low that it is not specific enough to dramatically change the pretest probability into motivating the clinician to refer the patient for a biopsy, and this also makes using toluidine blue staining as a screening device only to rule out oral cancer hard to justify. However, bitewing radiographs are a screening tool that is routinely used to screen for interproximal caries lesions. The prevalence of untreated dental caries is about 26%, and thus much higher than that of oral cancer.22 A recent systematic review reported the Sn and Sp of bitewing radiographs in identifying interproximal dentin lesions to be 36% and 87%, respectively.60 This means that from a nosographic point of view, bitewing radiographs are good at identifying people known not to have interproximal caries, but are weak to mediocre at identifying interproximal caries in people known to have a caries lesion. However, the clinician is interested in knowing how well bitewing radiographs perform in clinical practice. Based on the prevalence of dental caries in the general population (ppretest[D] = 26%) and the Sn and Sp given previously, the PPV and NPV of bitewing radiographs for interproximal caries lesion assessment are 67% and 79%, respectively. Remember that the outcome of a diagnostic test does not dictate the decision but rather guides it in a specific

direction. The clinical decision to excavate and restore the lesion will depend on other clinical factors unique to the individual patient—for example, fluoride exposure, diet, socioeconomics, regular recall visits, dental history, and individual caries risk assessment, to name a few. And of course, it is also based on evaluating the consequences of not treating an active lesion versus the consequences of overtreatment on the long-term strength of the tooth. For example, if the patient is identified as being at low risk, with a 1% chance of having a caries lesion before a bitewing radiograph, the PPV and NPV would be 5% and 99%, respectively. Table 15-10 shows the PPV and NPV for screening for interproximal dentin lesions with bitewing radiographs.

Conclusion In this chapter, we considered the response of three dentists to the diagnostic problem presented by Ralph, a fictional patient with an undiagnosed problem, as happens in the real world. For example, as reported in the Reader’s Digest article,83 different dentists came to different conclusions after examining the same patient, but these conclusions were based on gut feelings on the probability of possible diseases. If pressed by the patient (or the patient’s lawyer) to explain their reasoning, they might have to fall back on the phrase “It’s my professional opinion.” We have attempted to transform those gut feelings into probabilities of disease based on the characteristics of the diagnostic tests and the laws of probability. Because of the variability of signs and symptoms among patients with the same condition, these estimates of probabilities cannot be guaranteed to be absolutely accurate, but they are defensible. That is, a dentist using them can explain the reasoning used to arrive at the probabilities. Moreover, in using the approach, dentists can improve their diagnostic performance by selecting tests, including newly developed ones, that have the potential to improve the accuracy of their diagnoses. Chapter 23 deals with the use of probabilistic information in treatment decisions.

Acknowledgments Thank you to Don, with whom I had numerous editorial sessions. He helped me express my ideas in a clear and concise manner, which I hope will engage the reader to further reflection.

228

Brunette-CH15.indd 228

10/9/19 11:37 AM

References

References 1. 2. 3.

4.

5.

6.

7.

8. 9.

10. 11.

12.

13. 14. 15. 16. 17. 18. 19.

20. 21.

Perera R, Heneghan C. Making sense of diagnostic tests likelihood ratios. Evid Based Med 2006;11:130–131. Sox HC, Higgins MC, Owens DK. Introduction. In: Medical Decision Making, ed 2. Chichester: Wiley–Blackwell, 2013:17. Gøtzsche PC. The disease classification. In: Rational Diagnosis and Treatment: Evidence–Based Clinical Decision-Making, ed 4. Chichester: John Wiley & Sons, 2007:37–59. Symptom—Oxford Reference. http://www.oxfordreference.com/ view/10.1093/oi/authority.20110803100547191. Accessed 8 October 2018. Sign—Oxford Reference. http://www.oxfordreference.com/ view/10.1093/oi/authority.20110810105839961. Accessed 8 October 2018. English Oxford Living Dictionaries. Definition of Diagnosis. https://en.oxforddictionaries.com/definition/diagnosis. Accessed 8 October 2018. English Oxford Living Dictionaries. Definition of Treatment. https://en.oxforddictionaries.com/definition/treatment. Accessed 8 October 2018. Beck JD. Issues in assessment of diagnostic tests and risk for periodontal diseases. Periodontol 2000 1995;7:100–108 Sackett DL, Haynes RB, Tugwell P, Guyatt GH. Clinical Epidemiology: A Basic Science for Clinical Medicine, ed 2. Boston: Little Brown, 1991. Wulff HR. Rational Diagnosis and Treatment. Oxford: Blackwell Science, 1976. Dabelsteen E, Mackenzie IC. The scientific basis for oral diagnosis. In: Mackenzie IC, Squier CA, Dabelsteen E (eds). Oral Mucosal Diseases: Biology, Etiology and Therapy [Proceedings of the Second Dows Symposium, 99–102 1985, Iowa City]. Copenhagen: Laegeforeningers, 1987. Straus SE, Glasziou P, Richardson WS, Haynes RB. Evidence-Based Medicine: How to Practice and Teach EBM. Edinburg: Elsevier, 2019. Schechter MT, Sheps SB. Diagnostic testing revisited: Pathways through uncertainty. Can Med Assoc J 1985;132:755–760. Oakley C, Brunette DM. The use of diagnostic data in clinical dental practice. Dent Clin North Am 2002;46:87–115. Choi BC, Jokovic A. Diagnostic tests. J Can Den Assoc 1996;62:6– 7. Mathews DC, Banting DW. Authors response. J Can Dent Assoc 1996;62:7. Matthews DC, Banting DW, Bohay RN. The use of diagnostic tests to aid clinical diagnosis. J Can Dent Assoc 1995;61:785–791. Sox HC, Higgins MC, Owens DK. Differential Diagnosis. In: Medical Decision Making, ed 2. Chichester: Wiley–Blackwell, 2013:35. ADA Center for Evidence-Based Dentistry. Professionally-applied and Prescription-strength, Home-use Topical Fluoride Agents for Caries Prevention Clinical Practice Guideline (2013). https://ebd. ada.org/en/evidence/guidelines/topical-fluoride. Accessed 8 October 2018. Hülsmann M, Schäfer E (eds). Problems in Endodontics: Etiology, Diagnosis and Treatment. London: Quintessence, 2009.
 Kumar M, Nanavati R, Modi TG, Dobariya C. Oral cancer: Etiology and risk factors: A review. J Cancer Res Ther 2016;12:458–463.

22. National Institute of Dental and Craniofacial Research: Dental Caries (Tooth Decay) in Adults (Age 20 to 64). https://www.nidcr.nih.gov/research/data-statistics/dental-caries/adults. Accessed 8 October 2018. 23. National Institute of Dental and Craniofacial Research: Periodontal Disease in Adults (Age 20 to 64). https://www.nidcr.nih.gov/ research/data-statistics/periodontal-disease/adults. Accessed 8 October 2018. 24. American Academy of Periodontology—Research, Science, and Therapy Committee. Periodontal diseases of children and adolescents. Pediatr Dent 2008–2009;30(7 suppl):240–247. 25. United States Census Bureau. Annual Estimates of the Resident Population for Selected Age Groups by Sex for the United States, States, Counties, and Puerto Rico Commonwealth and Municipios: April 1, 2010 to July 1, 2013. 2013 Population Estimates. https://factfinder.census.gov/faces/tableservices/jsf/pages/ productview.xhtml?src=bkmk. Accessed 8 October 2018. 26. United States Department of Labor, Bureau of Labor Statistics: Occupational Employment and Wages, May 2017: 29-2021 Dental Hygienists. http://www.bls.gov/oes/2017/may/oes292021. htm. Accessed 8 October 2018. 27. American Dental Education Association. Dental Hygiene by the Numbers. https://www.adea.org/GoDental/Future_Dental_Hygienists/Dental_hygiene_by_the_numbers.aspx. Accessed 8 October 2018. 28. U.S. Department of Health and Human Services, National Institutes of Health: The Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure. https://www.nhlbi.nih.gov/files/docs/ guidelines/express.pdf. Accessed 8 October 2018. 29. Widmer GC. Physical characteristics associated with temporomandibular disorder. In: Sessle BJ, Bryant PS, Dionne RA (eds). Temporomandibular Disorders and Related Pain Condition. Seattle: IAPS, 1995:161–174 30. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science 1974;185:1124–1131. 31. Moore DS. Statistics: Concepts and Controversies. San Francisco: Freeman, 1979. 32. Schwartz WB, Wolf HJ, Pauker SG. Pathology and probabilities: A new approach to interpreting and reporting biopsies. N Engl J Med 1981;305:917–923. 33. Ephros H, Samit A. Leukoplakia and malignant transformation. Oral Surg Oral Med Oral Pathol Oral Radiol Endod 1997;83:187. 34. Kramer IRH. Basic histopathological features of oral premalignant lesions. In: Mackenzie IC, Dabelsteen E, Squier CA (eds). Oral Premalignancy [Proceedings of the First Dows Symposium, 23–34]. Iowa City: University of Iowa, 1980. 35. Kramer IRH. Prognosis from features observable by conventional histopathological examination. In: Mackenzie IC, Dabelsteen E, Squier CA (eds). Oral Premalignancy [Proceedings of the First Dows Symposium, 304–311]. Iowa City: University of Iowa, 1980. 36. Streiner DL, Norman GR. Health Measurement Scales. Oxford: Oxford University, 1989. 37. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: John Wiley, 1978. 38. Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with Confidence: Confidence Intervals and Statistical Guidelines, ed 2. London: BMJ, 2000.

229

Brunette-CH15.indd 229

10/9/19 11:37 AM

DIAGNOSIS TOOLS AND TESTING

39. Altman DG. Practical Statistics for Medical Research. Boca Raton: CRC, 1999. 40. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull 1971;76:378–382. 41. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174. 42. Fleiss JL, Chilton NW. The measurement of interexaminer agreement and periodontal disease. J Periodontal Res 1983;18:601– 606. 43. Koran LM. The reliability of clinical methods, data and judgment (first of two parts). N Engl J Med 1975;293:642–646. 44. Koran LM. The reliability of clinical methods, data and judgment (second of two parts). N Engl J Med 1975;293:695–701. 45. Fleiss JL, Fischman SL, Chilton NW, Park MH. Reliability of discrete measurements in caries trials. Caries Res 1979;13:23–31. 46. Shaw L, Murray JJ. Inter-examiner and intra-examiner reproducibility in clinical and radiographic diagnosis. Int Dent J 1975;25:280–288. 47. Fleiss JL, Kingman A. Statistical management of data in clinical research. Crit Rev Oral Biol Med 1990;1:55–66. 48. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1986. 49. Clemmer BA, Barbano JP. Reproducibility of periodontal scores in clinical trials. J Periodontal Res Suppl 1974;14:118–128. 50. Haffajee AD, Socransky SS, Goodson JM. Clinical parameters as predictors of destructive periodontal disease activity. J Clin Periodontol 1983;10:257–265. 51. Janssen PT, Faber JA, van Palenstein Helderman WH. Reproducibility of bleeding tendency measurements and reproducibility of mouth bleeding scores for individual patients. J Periodontal Res 1986;21:653–659. 52. Smith LW, Suomi JD, Green JC, Barbano JP. A study of intra-examiner variation in scoring oral hygiene status, gingival inflammation and epithelial attachment level. J Periodontol 1970;41:671– 674. 53. Abdel Wahab MH, Greenfield TA, Swallow JN. Interpretation of intraoral periapical radiographs. J Dent 1984;12:302–313. 54. Valachovic RW, Douglass CS, Berkey CS, McNeil BJ, Chauncey HH. Examiner reliability in dental radiography. J Dent Res 1986;65:432–436. 55. Gröndahl K, Gröndahl HG, Wennström J, Heijl L. Examiner agreement in estimating changes in periodontal bone from conventional and subtraction radiographs. J Clin Periodontol 1987;14:74– 79. 56. Reit C, Hollender L. Radiographic evaluation of endodontic therapy and the influence of observer variation. Scan J Dent Res 1983;91:205–212. 57. Fricton JR, Schiffman EL. Reliability of a craniomandibular index. J Dent Res 1986;65:1359–1364. 58. Abbas F, Hart AA, Oosting J, van der Velden U. Effect of training and probing force on the reproducibility of pocket depth measurements. J Periodontal Res 1982;17:226–234. 59. Gimenez T, Piovesan C, Braga MM, et al. Visual inspection for caries detection: A systematic review and meta-analysis. J Dent Res 2015;94:895–904. 60. Schwendicke F, Tzschoppe M, Paris S. Radiographic caries detection: A systematic review and meta-analysis. J Dent 2015;43:924–933. 61. Gimenez T, Braga MM, Raggio DP, Deery C, Ricketts DN, Mendes FM. Fluorescence-based methods for detecting caries lesions: Systematic review, meta-analysis and sources of heterogeneity. PLoS One 2013;8:e60421.

62. Abogazalah N, Ando M. Alternative methods to visual and radiographic examinations for approximal caries detection. J Oral Sci 2017;59:315–322. 63. Jeffcoat MK, Reddy M. Progression of probing attachment loss in adult periodontitis. J Periodontol 1991;62:185–189. 64. Bohner LOL, Mukai E, Oderich E, et al. Comparative analysis of imaging techniques for diagnostic accuracy of peri-implant bone defects: A meta-analysis. Oral Surg Oral Med Oral Pathol Oral Radiol 2017;124:432–440. 65. Mainkar A, Kim SG. Diagnostic accuracy of 5 dental pulp tests: A systematic review and meta-analysis. J Endod 2018;44:694– 702. 66. Leonardi Dutra K, Haas L, Porporatti AL, et al. Diagnostic accuracy of cone-beam computed tomography and conventional radiography on apical periodontitis: A systematic review and meta-analysis. J Endod 2016;42:356–364. 67. Chang E, Lam E, Shah P, Azarpazhooh A. Cone-beam computed tomography for detecting vertical root fractures in endodontically treated teeth: A systematic review. J Endod 2016;42:177– 185. 68. Reneker J, Paz J, Petrosino C, Cook C. Diagnostic accuracy of clinical tests and signs of temporomandibular joint disorders: A systematic review of the literature. J Orthop Sports Phys Ther 2011;41:408–416. 69. Westesson PL, Katzberg RW, Tallents RH, Sanchez-Woodworth RE, Svensson SA, Espeland MA. Temporomandibular joint: Comparison of MR images with crysosectional anatomy. Radiology 1987;164:59–64. 70. Tanimoto K, Petersson A, Rohlin M, Hansson LG, Johansen CC. Comparison of computed with conventional tomography in the evaluation of temporomandibular joint disease: A study of autopsy specimens. Dentomaxillofac Radiol 1990;19:21–27. 71. Dong XY, He S, Zhu L, et al. The diagnostic value of high-resolution ultrasonography for the detection of anterior disc displacement of the temporomandibular joint: A meta-analysis employing the HSROC statistical model. Int J Oral Maxillofac Surg 2015;44:852– 858. 72. Walsh T, Liu JL, Brocklehurst P, et al. Clinical assessment to screen for the detection of oral cavity cancer and potentially malignant disorders in apparently healthy adults. Cochrane Database Syst Rev 2013;21:CD010173. 73. Lingen MW, Abt E, Agrawal N, et al. Evidence-based clinical practice guideline for the evaluation of potentially malignant disorders in the oral cavity: A report of the American Dental Association. J Am Dent Assoc 2017;148:712–727. 74. Lane PM, Gilhuly T, Whitehead P, et al. Simple device for the direct visualization of oral-cavity tissue fluorescence. J Biomed Opt 2006;11:024006. 75. Balevi B. Evidence-based decision making: Should the general dentist adopt the use of the VELscope for routine screening for oral cancer? J Can Dent Assoc 2007;73:603–606. 76. Farah CS, McIntosh L, Georgiou A, McCullough MJ. Efficacy of tissue autofluorescence imaging (VELScope) in the visualization of oral mucosal lesions. Head Neck 2012;34:856–862. 77. Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: Explanation and elaboration. BMJ Open 2016;6:e012799. 78. Equator Network: STARD 2015: An Updated List of Essential Items for Reporting Diagnostic Accuracy Studies. http://www. equator-network.org/reporting-guidelines/stard. Accessed 23 June 2018.

230

Brunette-CH15.indd 230

10/9/19 11:37 AM

References

79. National Institutes of Dental and Craniofacial Research. Oral Cancer Prevalence (Total Number of Cases) by Age. http://www. nidcr.nih.gov/research/data-statistics/oral-cancer/prevalence. Accessed 23 July 2019. 80. Ralls SA, Cohen ME. Problems in identifying “bursts” of periodontal attachment loss. J Periodontol 1986;57:746–752. 81. Verdonscholt EH, Bronkhorst EM, Burgersdijk, König KG, Schaeken MJ, Truin GJ. Performance of some diagnostic systems in examination for small occlusal carious lesions. Caries Res 1992;26:59–64.

82. Centre for Evidence-Based Medicine: SpPin and SnNout. https:// www.cebm.net/2014/03/sppin-and-snnout. Accessed 23 June 2018. 83. MacDonald J. How honest are dentists? Reader’s Digest Canada 1998; (Sept).

231

Brunette-CH15.indd 231

10/9/19 11:37 AM

16 Research Strategies and Threats to Validity



Constraints, Purposes, and Objectives

Scientific principles and laws do not lie on the surface of nature. They are hidden, and must be wrested from nature by an active and elaborate technique of inquiry.” JOHN DEWEY 1

In many ways, the strategies involved in scientific research resemble those used in everyday problem solving. The amount of effort expended relates to the scope of the problem and the importance of the solution. Studies are performed against a background of constraints, such as funding, the investigator’s available time for research, the research niche involving accessibility of materials or subjects and competitors, as well as the need to comply with bureaucracies (eg, ethics review boards or animal care committees). In this chapter, I outline considerations common to most research strategies and introduce some basic concepts of clinical research. (More detailed information is given in chapters 17 to 20 and includes examples of the kinds of statistical tools used in analysis and interpretation.)

Funding Research requires funding. The successful research strategy can be pragmatically defined as that which results in an operating grant. However, faced with a paucity of funds and an abundance of applicants, granting agencies adopt rigorous criteria for research funding, often using peer-review mechanisms. Meeting these criteria is itself a science; the science of “grantsmanship” has been the subject of numerous articles and books.2–5 Scientists spend a depressing amount of time preparing grant applications and reviewing others’ applications. Recent grant success rates at the Canadian Institutes of Health Research hover around 15%—that is, roughly only one of seven grant applications is funded, and each one probably takes about 2 months to prepare. As a result, investigators must come up with imaginative approaches to finding funds from diverse sources and must choose topics amenable to small-scale studies. Dentistry is rich in

232

Brunette-CH16.indd 232

10/21/19 11:25 AM

Constraints, Purposes, and Objectives

this tradition, because some kinds of study, such as comparisons of materials, can attract support from industry and can ethically be incorporated into ongoing treatment protocols in the clinic. Although the specifics of evaluating proposals differ from topic to topic, much of the advice boils down to DeBakey’s6 four criteria for a publishable paper: The proposed research must be new, true, important, and comprehensible. Meador5 recalls Edison’s statement that genius is 1% inspiration and 99% perspiration, commenting that the 1% is indispensable. The important aims of the proposal should be emphasized.2–4 Mackenzie7 notes that neophyte grant writers use excessive wordiness, compromising comprehensibility. Reviewers likely will believe that poorly expressed ideas are equally poorly understood by the writer. DeBakey and DeBakey8 stress the need to avoid jargon and to practice good exposition comprising clarity, conciseness, continuity, consistency, simplicity, logical transition, and readability. Proposals can be deficient in many ways. A review of three studies of shortcomings of rejected or poorly rated applications identified the following three types of deficiency as being most prevalent9: 1. Shortcomings in the problem: 18% to 58%. Problems often had insufficient importance or were nebulous, unduly complex, or premature (ie, required a pilot study first). 2. Shortcomings in the approach: 73% to 99%. Issues included unsound methods, neglect of statistical issues, lack of imagination, or use of methods unsuited to the objectives of the study. 3. Weaknesses in the qualifications of the investigator: 38% to 60%. Mackenzie7 has devised a detailed checklist for proposal writing for dental faculty. But, before any actual writing is done, an investigator must consider some big-picture issues.

Time available for research A problem plaguing dental research is the busyness of dental academics. A dental clinician on faculty at a dental school must teach not only in the lecture hall but also in the clinic, serve on university committees, and participate in the activities of organized dentistry by providing expert advice to local regulatory bodies or dental associations. Because meetings occur at definite times (whereas research is a moveable feast), research

can sometimes disappear from the dental academic’s agenda. I recently had the opportunity to review the research activities of a dental school. The modal proportional time commitment for research was only 10%; it was not surprising that little research was being produced.

Niche To make progress in research, investigators must find a niche where they have a selective advantage. Life among academics circling a research topic can be likened to the competition among species that occurs near a coral reef. Figure 16-1 shows the niche that I occupied starting around 1983. For a while, I (represented as an angelfish labeled UBC) was the only investigator using microfabricated surfaces to study cell behavior (see Fig 16-1a). I had the whole coral reef of ignorance of effects of microfabricated surface topographies on cell behavior to explore. By 1986, investigators from the United Kingdom had moved into my reef, and I had to adjust to their activities by avoiding experiments that I could do less efficiently than they (see Fig 16-1b). For example, they might have developed methods to culture a particular type of cell with which I had no experience, so that avenue of development was closed off. By 1995, a number of American, Canadian, and European laboratories—with more resources than those available to me—were using microfabricated surfaces (see Fig 16-1c). To survive among the sharks and rays, I had to devise experiments that were either novel or that entailed use of systems where the University of British Columbia (UBC) had a competitive edge. Eventually, as the questions became more sophisticated, the available resources at UBC proved insufficient, and the only way to compete was to forge collaborations with laboratories in other universities (such as Professors Textor and Spencer of the Laboratory for Surface Science Technology at the Swiss Federal Institute of Technology) to gain access to equipment and insights unavailable at UBC. The net effect of all this activity (both mine and my competitors’) was that the coral reef of ignorance slowly was eaten away, and, as it eroded, new surfaces were exposed that initiated new research questions. The problem for little fish inhabiting some niches is Darwin’s “survival of the fittest.” Some investigators dropped out after few contributions; others have survived for decades. Niches come in various forms—competence in technique, availability of specific patient populations, collaboration with

233

Brunette-CH16.indd 233

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY

1986

1983 Implants Implants

UBC

UBC Coral reef of ignorance

Glasgow

Coral reef of ignorance MRC

Cell behavior

Cell behavior

Fig 16-1b Investigative niche for micropatterned and micromachined surfaces in 1986. MRC, Medical Research Council (United Kingdom).

Fig 16-1a Investigative niche for micropatterned and micromachined surfaces in 1983. UBC, University of British Columbia.

1995

MIT

Clemson Cornell

UBC Glasgow

London

Leyden

East Anglia

Northwestern

GÖteborg MRC Moscow

Washington

NYU

U of T

(Bottom feeder)

Fig 16-1c Investigative niche for micropatterned and micromachined surfaces in 1995. MIT, Massachusetts Institute of Technology; NYU, New York University; U of T, University of Toronto.

234

Brunette-CH16.indd 234

10/21/19 11:25 AM

Constraints, Purposes, and Objectives

industry or other university laboratories, or superior organizational arrangements.

Demonstration versus discovery Investigators vary in personality, from stamp collectors to wild-eyed dreamers; different types of investigation attract different styles of investigators. The randomized controlled trial (RCT; also called randomized clinical trial and randomized control trial) has—at least potentially—the power to give firm actionable conclusions and has a definite attraction for clinical investigators who want to improve current practice in their own working lifetimes. Systematic reviews of such trials can yield a gold standard for treatment, but such reviews often have limitations—because of heterogeneity in patients, techniques, timing of observations, and choice of outcome variables—that may obviate combining studies. Some of these problems can be minimized by increasing study size and standardizing methods across studies, but increasing the scale of the study entails costs (eg, statisticians, elaborate software for managing patients, and sufficient support staff). The high cost results in such studies being undertaken only when the risk of failure is low—a situation that often exists when an investigator is making only incremental changes (eg, variations in structure of a drug family that has already exhibited some success). Complex studies, such as multicenter trials, succeed because of the organizational and problem-solving abilities of the investigators, who must ensure that high-quality data are collected under well-defined conditions. But doing such studies is not necessarily a creative endeavor, because, in essence, the goal is that of demonstration—showing that a particular treatment, often previously tested in smaller-scale studies—has a high probability of success under a wider range of conditions. Discovery-based strategies typically occur on a smaller scale and could involve just a single investigator. The cost is low, but the risk of failure to come to a definitive conclusion may be high. The conclusion of the study might not widely affect clinical practice, because clinicians might be skeptical of trying treatments that have not had wide exposure. Nevertheless, this riskier approach is most likely to seed exciting new developments.

Focus In writing grant proposals, it is common to speculate on the significance of the work, often with respect to a particular practical end. The purposes of scientific papers, which often center on just one specific aim of the proposal, are typically more limited. Experience has shown that research is most likely to be successful when it is purposeful, that is, when it is designed to yield information in discrete, concrete packages. Well-designed studies propose a significant question and answer it well. Writers of manuals on grantsmanship counsel grant writers to focus on specific aspects of problems, making sure that the problem is not overstated, contains concrete and attainable objectives, and concentrates on specific topics. In clinical research, a lack of focus often results when investigators try to answer too many questions with one piece of research. The available subjects become split into groups too small to provide statistical power for any one aspect of the investigation. However, calculations for statistical power require information on the size of the expected effect and the variability in the data. Often this information can be gathered only through pilot studies. In requesting funds for a project, applicants must demonstrate that the proposal will successfully produce interpretable results.

Classification of research strategies and the certainty of their conclusions The algorithm for classifying research designs is given in Fig 16-210. The important questions in determining the design include the following: • Did the investigator assign the exposure? • Were subjects randomly allocated into groups? • Was there a comparison group? • What was the direction (time sequence) of exposure and outcome (prospective and retrospective), or, were they concurrent? The strength of the designs is illustrated in Fig 16-3. The characteristics of the various approaches are briefly discussed below (see also Frey11), and more detail is given for some designs in subsequent chapters. Non-experimental designs involve just one group of subjects (or, more broadly, observational units), be they people or animals, such as might occur in determining the prevalence of caries in a community. The simplest form would be purely descriptive and involve

235

Brunette-CH16.indd 235

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY

Did investigator assign exposures? Yes

Observational study

Random allocation?

Comparison group?

Yes

No

Analytical study

No Descriptive study

Direction?

Exposure and outcome at the same time

Outcome Exposure

Cohort study

Yes

Nonrandomized controlled trial

Randomized controlled trial

Exposure

No

Experimental study

Outcome

Casecontrol study

Crosssectional study

Fig 16-2 Flowchart for the classification of research designs. (Reprinted from Grimes and Shulz10 with permission.)

only summarizing the data by categories. Alternatively, a relationship between measured variables (such as a weight-height relationship) may be quantified. Such studies do not provide evidence of cause and effect, but, nevertheless, they may provide clues. For example, Jenner12 noted correlation between occupation (milk maid) and decreased susceptibility to disease (smallpox), leading him to conduct more definitive studies that led to the use of vaccinations. Pre-experimental designs also usually involve one group, but more measurements are involved over time to determine whether changes have occurred. This type of design is used in dental practices, as dentists try to optimize their procedures. For example, a dentist may treat a group of patients using a new material in their restorations, then may follow-up with them to determine time required before the restorations need to be replaced. A comparison would be made to

the dentist’s expectations of lifetime for the material currently used—an informal type of historical control. Alternatively, a practitioner might make a set of measurements on patients, then introduce a treatment and make subsequent measurements. For example, a dental hygienist might measure plaque scores on patients, instruct them in oral hygiene, and follow-up on their progress to assess how knowledge of proper technique influenced their plaque scores. Such a study would not be definitive in establishing a cause-effect relationship, because there would be other confounding variables that might influence the outcome. For example, the patients might have changed their eating behavior because they knew that their plaque scores were being monitored over time. Quasi-experimental designs involve a comparison group, but, typically, subjects are not assigned to the treated or control group randomly. On occasion, the lack of

236

Brunette-CH16.indd 236

10/21/19 11:25 AM

Constraints, Purposes, and Objectives

Systematic investigation

Randomized controlled trials

Strength of evidence

Cohort studies Case-control studies Case study Animal models

Case series Bench studies

Analogy

Expert opinion

Authority

Increased plausibility

Fig 16-3 Evidence pyramid for investigative approaches and its support by inductive logic through analogy and authority.

random assignment occurs because the formation of groups was outside the investigators’ control (as can happen because of referral patterns or patient selection of treatment). Without random assignment, the two groups may well differ in a number of measured or unmeasured variables that might influence the results of the study. True experimental designs include appropriate comparison groups, and subjects are assigned to the groups randomly—a procedure that enables investigators to assume that the groups are equal on unmeasured variables and, thus, would not contribute to any effects observed as a result of the experimental manipulation. A question of current debate is the certainty of the conclusions of various types of design. Early estimates, such as those attributed to Sackett,13 suggested that RCTs give around 90% certainty, whereas a cohort analytic study is around 20%; a case-control study, about 12%; a before-after study, around 6%; and a descriptive study, about 2%. A landmark paper by Sacks et al14 later found that the percentage of trials in which the agent

being tested was found to be effective depended on the design of the investigation. Only 20% of papers using RCT claimed the agent was effective, whereas almost 80% of historical-controlled trials claimed effectiveness. Bias in patient selection was posited as the prime cause of the difference. More recently, Concato et al15 have undertaken comparisons of the results of RCTs with well-designed observational studies (cohort and case-control) and argue a minority view that observational studies did not necessarily overestimate treatment effectiveness. Moreover, they observed the observational studies were less prone to heterogeneity (defined as variability in point estimates of treatment effectiveness among studies). Concato et al15 suggest that the observational studies are likely to include a broader representation of the at-risk population. Perhaps the situation is best summed up by Grossman and Mackenzie,16 who posit that studies should be evaluated by appropriate criteria for particular applications, and not just discarded or accepted according to a simplistic RCT/non-RCT dichotomy.

237

Brunette-CH16.indd 237

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY

Target population of patients with gingivitis being treated by Canadian dental hygienists (> 10,000,00) Not considered Recruited for consideration by graduate student working in two Vancouver dental offices (67 inquiries, 57 screened) Failed to meet criteria (28) Met inclusion criteria (29) Did not agree to participate (2) Agreed to participate (27) Dropped out (1) Completed study (26) Fig 16-4 Subject selection and flow for a study on the effects of chlorhexidine-soaked dental floss funded by the Canadian Dental Hygiene Association Fund for Research and Education.

Study design: Sampling and time’s arrow Studies differ in how the sample is selected and in the relationship of the observations to time’s arrow. Cohort studies have groups that are selected on the basis of exposure; a cohort is simply a group of individuals who share certain characteristics. The prospective cohort study goes forward with time; groups selected on the basis of exposure to a risk factor are compared with a control group that is not exposed. Observations are made over a period of time, and an outcome of interest—be it disease incidence or lifetime of a restoration—is determined. Comparison is made with a concurrent control group that may or may not be matched on various characteristics. The retrospective cohort study examines data that have already been collected. Groups are formed based on exposure at a given point in time and are followed from that time forward to the present.

Cross-sectional designs occur when all of the information is related to the point in time when the data are collected. Intervention designs are prospective and involve assignment of the exposure. Case-control designs are sampled according to outcome. Persons with the disease or other outcome of interest are compared with people without the disease or outcome to determine which factors may have played a role in developing the disease or outcome of interest. Case-control designs are always retrospective or cross-sectional, because the outcome (ie, the case) is the basis of the selection into groups. Surveys sample a defined population to determine the prevalence of a condition, as well as to assess associations such as physical activity and weight. Because they cannot determine the sequence of the association (ie, whether the putative cause preceded the effect), surveys do not speak directly to cause-effect relationships, but they can provide clues.

Subject selection Studies actually include only a few of the many people to whom the investigator hopes the results apply. The selection of subjects has the inevitable side effect of limiting, at least to some degree, the interpretation. Figure 16-4 presents the subject-selection process of a study conducted by my graduate student, Pauline Imai.17 Because the study was funded by the Canadian Dental Hygiene Association Fund for Research and Education, ideally our patron probably hoped the results would refer to the target population of the patients with gingivitis of Canadian dental hygienists. Notice that each step in the selection introduces a possibility of bias. We worked in two practices in Vancouver, so the extent to which the study population represented all Canadian patients with gingivitis is unknown; however, considering the heterogeneity of the larger population, the study group likely was not highly representative. In addition, people who agree to participate in studies may well differ in a number of properties from the less altruistic masses. Small-scale studies like this are sometimes referred to as explanatory trials; their intent is to determine whether a treatment works under certain well-specified conditions. These studies stand in contradistinction to pragmatic trials, which seek to determine whether a treatment works under real-world conditions. Thus, a consequence of the selection procedure is that often the participants in a study are not broadly representative of the target population. Because studies

238

Brunette-CH16.indd 238

10/21/19 11:25 AM

The Concept of Validity

on the same topic conducted in different locations may experience different influences in selecting study participants, there may be markedly different types of subjects, and results may vary. Consider the difference between the likely participants in a study on guided tissue regeneration done in a private practice limited to periodontics versus study participants recruited from patients being treated at a dental school. The two sets of subjects would probably differ in social and economic status, knowledge of oral hygiene procedures, and motivation. For example, if the treatment was sensitive to the oral hygiene practices of the participants, there might be a difference in outcome, even if the intervention was applied equally well in both locations. Selection bias, which can affect both the external and internal validity of a study, is discussed in more detail later in this chapter.

A successful cohort study: The Dunedin study Sometimes the problems of selection and dropouts can be solved if appropriate care and adequate resources are available. The Dunedin study birth cohort18 comprised all those born between 1 April 1972 and 31 March 1973. This sample comprised some 1,000 children so that it was calculated that a condition with a disorder of about 3% in the population would be represented by about 30 children in the sample. The children were recalled and assessed for a wide variety of health measures— including oral health, as well as lifestyles, behaviors, attitudes—every 2 years up until age 15 then at ages 18, 21, 26, 32, and 38 with more assessments being scheduled for ages 44 and 50. To reduce the number of dropouts, the Dunedin study team tracks all participants, wherever they moved to over the globe, and funds their return to Dunedin for the assessment. A remarkable level of 96% of all living study members participated in the age 38 assessment. The timespan of the study enables study of the natural history of disease in individuals19 and as it incorporates three generations of the same families, inferences can be made on the influence of parenting practices on children. In the instance of oral health it was found that the precise effects of the oral health status of one generation on that of the next within families are unclear, but people with poor oral heal tend to have parents with poor oral health.20 A remarkable finding from the Dunedin study is that over a 40-year period, childhood self-control strongly predicted adult success, in people of high or low intelligence, be they rich or poor.21

The Concept of Validity An important issue in research design is the validity of the study. The term validity is used somewhat differently when discussing a study than when discussing a measurement (as was done in chapters 12 and 13). For a study, a distinction is made between internal and external validity. Internal validity refers to the extent that the data support the hypothesis. A study is internally valid when conclusions about the hypothesis can be inferred from the data and there are no plausible alternative explanations. An investigator must understand the threats to internal validity.

Campbell and Stanley’s threats to internal validity Chapter 7, on inductive logic, emphasized that the logical criticism of inductive arguments often depends on constructing alternative hypotheses that can also explain the results. Proposing alternative hypotheses can require detailed specialized knowledge about the topic under investigation. But often in clinical research, certain factors that lead to alternative hypotheses occur so frequently that they merit special consideration. These factors (commonly called threats to validity) were first explicitly outlined for educational research by Campbell and Stanley.22 Campbell and Stanley also introduced a shorthand notation for describing common investigation strategies that is briefly outlined here, as it aids description of the research strategies to be described later. Typically, a study begins with identifying eligible subjects (or other experimental units, such as animals or cell cultures). Then these may be exposed to the following operations: R = Randomized. Randomization must be done by a specific procedure, such as use of a random-number table; it is not synonymous with haphazard allocation. The purpose of such random allocation is to ensure that the groups are comparable before any manipulations or treatments are performed. O = Observed. In some cases, several observations will be made successively, and these are designated as O1, O2, O3, etc. X = Treated. In some research strategies, a treatment will be applied to the group.

239

Brunette-CH16.indd 239

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY

This list, although not exhaustive, covers several of the most important factors in interpreting research. The factors were originally identified from studies of education but have found wide applicability in other fields. Although each is discussed separately, interactions can occur. The list is presented here because it will enable us to consider how each of the research strategies may be affected by these factors. Appendix 5, abbreviated from Campbell and Stanley, lists the possible deficiencies in various experimental and quasi-experimental designs. But the possibility of a given threat to validity does not indicate that the threat is probable or that it constitutes a plausible explanation of the data. In naturalistic observation, for example, the effect of time is often a threat to validity, but it may not be a significant effect if the interval between observations is short or if the process itself is time invariant. Thus, in evaluating the influence of these threats to validity, the goal is not only to determine whether they might occur but also the magnitude of their influence.

History Consider the following types of investigation. A given Group A (eg, the population of Toronto) is observed for some property, such as the prevalence of tooth decay. Conclusions might be reached about the group by itself; the data might indicate that caries prevalence decreased with time, or the data might be compared with another Group, B, where observations were made at the same time or at different times. However, during the elapsed interval, many events can occur, and these can interfere with the study. For example, the analysis of the incidence of tooth decay for a given region would be complicated by factors that altered the amount of fluoride ingested. Another problem related to history is that often the investigator does not know exactly the factors to which the group was exposed, and these factors may differ between groups. This problem can be controlled by designing a longitudinal study in which the groups are selected in the present and examined in the future. In this prospective design, the groups will be exposed to the same historical factors concurrently as the groups age together. A methodologic problem with longitudinal studies is sample shrinkage. For example, in studying dental health in the elderly population, an investigator could lose many in the sample, as they might die before the study is completed (see discussion on mortality later in this chapter). The effects of history become particularly important in situations where the observations are made

over an extended period. One of the problems in the study of environmental carcinogenesis is that the time required for the tumors to develop can be long, and, consequently, the subjects get exposed to many chemical compounds. In surgical research, Spitzer23 notes that you should always approach research dependent on historical controls with a great deal of constructive skepticism. Clinical science advances in general, and it would be easy to attribute improvement to a particular surgical (or dental) procedure when the benefit could be due to the array of accompanying interventions used in the total treatment of the patient.

Maturation There are many changes associated with aging. A number of longitudinal observational studies, such as those that report the natural history of periodontal disease, have been designed to assess these changes. When experiments run for a long period, it is difficult to separate the effects of aging from the effects of treatment. Spontaneous remissions can be considered another effect of maturation. Maturation can account for the supposed efficacy of quack remedies. “The art of medicine,” Voltaire tells us, “consists of amusing the patient while nature cures the disease.” At first glance, it appears that maturation could be studied by simply observing individuals of various ages. In this approach, called a cross-sectional study, an investigator slices, as it were, the population into sections based on age and observes some property. The advantage of this method is that it enables an investigator to get all of the information at once and to obtain as large a sample as desired. The difficulty is that the effects of maturation are confounded with the effects of history. An individual who is now 60 years old not only would be 40 years older than an individual who is 20, but also would have lived his or her earlier years in an environment when nutritional and other health-related practices differed from those experienced by the 20-year-old subject.

Reaction When we measure something, we alter it. The alteration may be slight, but it can be large. For example, gingival crevicular fluid flow is measured by placing paper strips in the crevice. But the mechanical irritation caused by the filter paper increases vascular permeability and fluid flow.24

240

Brunette-CH16.indd 240

10/21/19 11:25 AM

The Concept of Validity

Selection When subjects are grouped, there is the chance that a selection process is operative that makes the experimental and control groups different even before the treatment is applied. For example, we might want to justify periodontal surgical therapy by designing a study to compare the periodontal health of a group of people who elected to take surgical treatment in a prominent periodontal practice with a group who decided against surgical treatment. After a number of years, the group that opted for treatment might enjoy better periodontal health. However, it is not difficult to imagine that the two groups would differ in many aspects besides the fact that one group was treated. The individuals who chose treatment would likely be wealthier or more highly motivated with respect to dental hygiene. For such selection-biased studies, we must ask what caused the differences—the way the groups were selected or the treatment itself?

Mortality As a study progresses, subjects are often lost from the various comparison groups. This loss of subjects is particularly noticeable in some long-term studies in which the subjects are treated at dental schools, where it often seems difficult to get the patients to return for reexamination or observation. The question that arises is: Are the patients who returned different from the ones who did not? The reasons the patients stayed away may be relevant to the experiment (eg, the treatment did not work, and they went elsewhere), or they may be trivial (eg, the patient moved and could not be located). Nevertheless, the experiment can be compromised if there is a large mortality of subjects in the study and/or the groups in the study are affected differently by mortality.

Instrumentation A measuring instrument may change during the course of an experiment, so that the results obtained at the beginning of a study are different from those that are obtained at the end. While it is unlikely that the markings on a periodontal probe will change, the person using it may change his or her technique over time. This problem is generally handled by calibrating instruments and observers. More reliable measurements of student performance on restorative dental procedures

can be obtained when the evaluators have undergone some sort of calibration procedure.

Regression Have you ever wondered why instructors are reluctant to offer praise? They may be victims of statistical regression. We can think of a student’s performance as being influenced by a number of factors, some of which are random. Sometimes, albeit rarely, all random factors occur in such a way that they produce a product that is better than average. Let us suppose that the instructor praises the student when the student demonstrates such excellence. However, on the next procedure this unlikely combination of factors will not occur; the product will be worse, and the instructor will conclude that praise has a bad effect on performance. Unfortunately, if the random factors all conspire against the student and he or she produces a bad product, the instructor will berate the student, and, in all probability, the student will do better the next time. The instructor will conclude that the tongue-lashing worked. The phenomenon at work is called statistical regression. It is most important in experiments where subjects are selected into groups on the basis of extreme scores on characteristics measured by an imprecise procedure (ie, has a large standard deviation), because extreme measurements tend to regress toward the mean, regardless of any treatment. Dental students and their instructors are not the only victims of statistical regression; the phenomenon has been documented for flight instructors and their students. Tversky and Kahneman25 concluded the following: By regression alone, therefore, behavior is most likely to improve after punishment and most likely to deteriorate after reward. Consequently the human condition is such that, by chance alone, one is most often rewarded by punishing others and most often punished by rewarding them. Regression toward the mean can be a potential problem in any study that entails repeat measurements. Gunsolley et al26 found that the majority of perceived loss of attachment due to scaling at sites of minimal probing depth that have been reported in many studies may be caused by the statistical phenomenon of regression toward the mean.

241

Brunette-CH16.indd 241

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY

External validity External validity refers to the extent to which findings of a particular study can be generalized to other conditions or populations. It should be noted that external validity is a secondary consideration; it is relevant only when a study has internal validity. If there are many ways of interpreting the data (ie, low internal validity), nothing is established, and there is no point in extrapolating the findings to other conditions. Evaluating external validity requires clinical—and not statistical—expertise. Checklists for the evaluation of papers often mention external validity only briefly and do not give specific procedures. One widely available checklist focuses on two questions involving locality and available evidence. With respect to locality, can the results be applied to the local population that the physician is treating? Clinicians are urged to consider whether the subjects covered in the study could be sufficiently different from the ones they treat to cause concern and whether the local setting differs from that of the study. The second question deals with the issue of whether the results of the study agree with other available evidence. The vague guidelines illustrate the current situation; although internal validity can be fairly rigorously evaluated, assessing external validity is much more subjective. Although difficult to fix, problems with external validity are often easy to spot, sometimes just by comparing the title of the study with the sample actually studied. Sechrest and Hannah27 note that every sample can be considered a convenience sample, because investigators can only study what is available to them. Nevertheless, some investigators work hard to obtain samples representative of the population to which they intend to refer their findings. It would seem that evaluating external validity requires detailed examination of the population or variables used in the study and of the patients of a particular practice to whom the treatment might be applied. Detailed information on both groups is seldom available, but, on occasion, large obvious differences are evident. We can easily imagine that a study on oral hygiene using dental students as subjects may not extrapolate well to the general population, who have less knowledge of oral hygiene and its importance to health. Another test of external validity is whether the relationship between the variables in the study agrees with other research on the topic. That is, if many studies in different settings agree, it seems likely that the effect is large enough to overcome local differences in population and facilities.

As noted earlier, selection criteria for inclusion into a study exert a major effect on external validity, because, strictly speaking, they effectively define and thereby constrict the target population. Nevertheless, it is possible that the findings of a study apply more widely; that is, they are not limited to the particular group specified by the inclusion and exclusion criteria, or to other specifics of the study (such as where it was done). The best answer to a criticism of external validity is to repeat the study in other contexts with different patient groups in other medical/dental centers. This approach is one rationale for the use of multicenter trials. Another approach is to make a case for the results applying more generally by arguing the similarity in relevant properties of the expanded target population and the original target population, specified by the inclusion and exclusion criteria. However, because there are so many known and unknown factors that could influence any given result, determining similarity between any two populations (such as a study population and the patients in any given practice) has the potential to be problematic and uncertain. After reviewing some of the issues involved in determining external validity of RCTs, Rothwell28 delivered a somewhat pessimistic report: 1. Differences exist between countries in the time after diagnosis to treatment, and very different treatment effects can be observed. 2. Eligibility criteria are often poorly reported even in large RCTs; moreover, sometimes subjects are selected after “run-in” periods, where poorly compliant subjects are excluded. On occasion, patients who have previously been demonstrated to respond to related drugs are selected, leading to an “enriched” subject population. 3. Characteristics of randomized patients who are recruited are found to differ from the general pool of those who are eligible, even in large trials. Recruitment of less than 10% of potentially eligible patients is common. 4. Trials may have protocols that differ from those in usual clinical practice. Consider the original implant studies of Brånemark, in which patients benefited from much more expert scrutiny than would be possible in a private practice. 5. Outcome measures used in clinical trials may be surrogates for the clinically relevant variables. 6. Adverse effects of treatment are poorly reported.

242

Brunette-CH16.indd 242

10/21/19 11:25 AM

Categories and Prevalence of Problems

a

Small trial size Large trial size Generalizability

Low

High

Complex treatment b

Simple treatment

Fig 16-5 Clinical trials are more generalizable when they are large and involve simple treatments. (a) A larger sample benefits both the statistical generalizability (ie, more precise estimates of a population parameter) as well as the scientific generalizability (ie, applicability to another target population), because a large trial size will tend to select a more diverse population that may contain elements similar to other populations. (b) A simple treatment also benefits generalizability. This is illustrated by a treatment in which one procedure with a probability of success of 90% is required; success will occur 90% of the time. However, if three procedures are required, each with a probability of success of 90%, success will occur 70% (ie, 0.93) of the time. If five procedures, each with a probability of success of 90% are required, success will occur only 59% (ie, 0.95) of the time.

Rothwell28 concluded that some trials have excellent external validity, but many do not, particularly some of those performed by the pharmaceutical industry. A more positive view was taken in an informal review29 that concluded that generalizing results from well-conducted trials to clinical practice can mostly be carried out with confidence, especially for simple therapies. More complex therapies require careful consideration. Based, in part, on studies reporting the outcome of complex surgeries, the authors suggested the relationship illustrated in Fig 16-5. Simply put, the larger a study’s sample size and the simpler the treatment, the more confident a reader becomes in extrapolating from the study to other populations. We can see how surgery, in particular, might be chosen as the example, because the success of surgical procedures relies to a large extent on the personal technical skills of the surgeon. For example, readers of the dental research literature may have no way of assessing the technical competence of the individuals performing the procedures and, because of the typical brevity of the materials and methods section, may have little idea of how difficult some procedures are to accomplish.

Categories and Prevalence of Problems Sechrest and Hannah27 reviewed 100 predominantly non-experimental studies (99%), including descriptive studies, relational studies (often for the purpose of predicting some outcome), and quasi-experimental studies (defined as two or more groups compared prospectively or retrospectively with respect to some characteristic, with the intention of inferring differences between the populations from which the groups were drawn). The problems were classified as the following: • Sampling. Low response rate (< 25%), failure to describe the basic characteristics, and unacknowledged bias in the sample; 50% of the studies had sampling problems. • Measurement. Failure to mention reliability and validity of measurement, low reliability coefficients, and probable bias due to recall ineffectiveness and

243

Brunette-CH16.indd 243

10/21/19 11:25 AM

RESEARCH STRATEGIES AND THREATS TO VALIDITY











self-report; 60% of the studies had measurement problems. External validity. Errors occurred in many instances when data collected on narrow or convenience bases were extrapolated to dissimilar populations; 58% of the studies had problems with external validity. Internal validity. Failure to rule out likely threats to internal validity, such as selection; 19% of the studies had problems with internal validity Construct validity. Failure to provide a precise explanation of the concepts. Lack of precision in concept gives investigators too much latitude in the operationalization of manipulations and measures; 25% of the studies had problems with construct validity. Statistical problems. Liberally using multiple univariate tests without adjustment for inflation of type I error rate (ie, failing to provide a Bonferroni adjustment), failing to report intercorrelations among independent or dependent variables, stating results to be significantly different without appropriate tests, and attributing importance to statistically significant but relatively meaningless outcomes; 46% had problems with the validity of their statistical conclusions. Unjustified conclusions. Often causal conclusions, or claiming unjustified differences between groups and changes over time; 33% of the studies had unjustified conclusions.

References 1.

Dewey J. Reconstruction in philosophy. Minola, NY: Dover Publications, 2004:18–20. 2. Hetenyi G. Features of a successful grant application. Presented at the College of Medicine at the University of Saskatchewan, Saskatoon, Canada, 29 April 1991. 3. Reif-Lehrer L (ed). Writing a Successful Grant Application. Boston: Science Books International, 1982. 4. Dingle JT (ed). How to Obtain Biomedical Research Funding. New York: Elsevier, 1986. 5. Meador R. Guidelines for Preparing Proposals. Chelsea, MI: Lewis, 1986:47. 6. DeBakey L. The Scientific Journal: Editorial Policies and Practices: Guidelines for Editors, Reviewers, and Authors. St Louis: Mosby, 1976:1–3. 7. Mackenzie RS. Grant writing and review for dental faculty. J Dent Educ 1986;50:180. 8. DeBakey L, DeBakey S. The art of persuasion: Logic and language in proposal writing. Grants Magazine 1978;1:43. 9. Cuca JM. NIH grant applications for clinical research: Reasons for poor ratings or disapproval. Clin Res 1983;31:453–461. 10. Grimes DA, Shulz KF. An overview of clinical research: The lay of the land. Lancet 2002;359:57–61. 11. Frey B. Statistics Hacks. Sebastopol, CA: O’Reilly, 2006:33.

12. Jenner E. An inquiry into the causes and effects of the Variole Vaccinae, a disease discovered in some of the Western counties, particularly Gloucestershire and known by the name of smallpox. London: DN Shury, 1801. 13. Cited in: Helewa A, Walker JM. Critical Evaluation of Research in Physical Rehabilitation. Philadelphia: Saunders, 2000:15. 14. Sacks H, Chalmers TC, Smith H Jr. Randomized versus historical controls for clinical trials. Am J Med 1982;72:233–240. 15. Concato J, Shah N, Horwitz RI. Randomized controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med 2000;342:1887–1892. 16. Grossman J, Mackenzie FJ. The randomized controlled trial: Gold standard, or merely standard? Perspect Biol Med 2005;48:516– 534. 17. Imai PH, Putnins EE, Brunette DM. The effects of flossing with a chlorhexidene solution on interproximal gingivitis: A randomized controlled trial. Can J Dent Hyg 2008;42:8–14. 18. The Dunedin Study. https://dunedinstudy.otago.ac.nz. Accessed 28 June 2019. 19. Thomson WM, Shearer DM, Broadbent JM, Foster Page LA, Poulton R. The natural history of periodontal attachment loss during the third and fourth decades of life. J Clin Periodontol 2013; 40:672–680. 20. Shearer DM, Thomson WM, Caspi A, Moffitt TE, Broadbent JM, Poulton R. Family history and oral health: Findings from the Dunedin Family History Study. Community Dent Oral Epidemiol 2012;40:105–115. 21. Moffit TE, Poulton R, Caspi A. Lifelong impact of early self-control. Am Sci 2013;101:352–359. 22. Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally, 1963:5–6. 23. Spitzer WO. Selected nonexperimental methods. In: Troidl H, Spitzer WO, McPeek B, et al (eds). Principles and Practice of Research: Strategies for Surgical Investigators. Berlin: Springer-Verlag, 1991:222. 24. Cimasoni G. The Crevicular Fluid, No. 12, Monographs in Oral Science Series. Basel, NY: Karger, 1974:94–95. 25. Tversky A, Kahneman D. Judgment under certainty: Heuristics and biases. Science 1974;185:1124. 26. Gunsolley JC, Yeung GM, Butler JH, Waldrop TC. Is loss of attachment due to root planing and scaling in sites with minimal probing depths a statistical or real occurrence? J Periodontol 2001;72:349–353. 27. Sechrest L, Hannah M. The critical importance of nonexperimental data. In: Sechrest L, Perrin E, Bunker JP (eds). Research Methodology: Strengthening Causal Interpretations of Nonexperimental Data: Conference Proceedings. Rockville, MD: US Department of Health and Human Services, Public Health Service, Agency for Health Care Policy, 1990:1–7. 28. Rothwell PM. Factors that can affect the external validity of randomised controlled trials. PLoS Clin Trials 2006;1:e9. 29. Flather M, Delahunty N, Collinson J. Generalizing results of randomized trials to clinical practice: Reliability and cautions. Clin Trials 2006;3:508–512.

244

Brunette-CH16.indd 244

10/21/19 11:25 AM

17 Observation

T

he strength of studies using naturalistic observation is that they apply directly to the real world. These studies are not burdened by the gap often found in experimentation in which the artificial environment of the laboratory differs from real field conditions. In naturalistic observation, no attempt is made to intervene in the course of nature, but considerable skill may be required to contrive conditions whereby observations can be made and data collected.

Observation-Description Strategy According to Beveridge (albeit writing in 1950),2 more discoveries have arisen from intense observation of very limited material than from statistics applied to large groups. Darwin’s On the Origin of Species is perhaps the greatest scientific work based almost wholly on observation. Observation is still a useful strategy. Although the effect of fluoride has been documented with sophisticated laboratory techniques and elaborate epidemiologic surveys, the initial findings were based on detailed and careful naturalistic observation.3 The Audubon Society’s Christmas Bird Count continues the tradition; more than 50,000 members gather data on the locales of over a thousand species in the Western hemisphere. Between 1954 and 1989, some 100 papers in refereed journals used Christmas Bird Count data.4 Classical epidemiology represents a use of the observation-description strategy, for it is concerned with three major variables that describe the distribution of disease or condition: person, place, and time. Typical data from classical epidemiologic studies include data on the prevalence and incidence of disease: Prevalence =



It is the theory that decides what we can observe.” ALBERT EINSTEIN1

no. of people with the disease no. of people at risk

Incidence is determined over a specified period and is defined as follows: Incidence =

no. of new cases in fixed time period no. of people at risk

245

Brunette-CH17.indd 245

10/9/19 11:48 AM

OBSERVATION

Fig 17-1 The photograph of the wood duck shown to the students.

Prevalence and incidence are linked by the following formula:

dying cell from a vigorously healthy round mitotic cell. They have to learn how to look. The dental students in my class did not know much about bird watching, but they did have a conception of an ideal bird. Most of their descriptions were very brief, but most had the very good idea of drawing a picture of a bird to refresh their memory and provide an exemplar that they could label. Three of their answers are given in Figs 17-2. One student showed her alleged wood duck with a bright red worm hanging out of its mouth. I fear that student was pulling the old professor’s leg. But the students also made errors by attributing to the wood duck features of their concepts of an ideal bird, ie, they exhibited confirmation bias. Rather than a wide bill they pictured sharp beaks, rather than duck feet (which are three-toed with webbing) they showed just the toes. They seemed to have forgotten the old joke:

Prevalence = incidence × average duration (of disease) This formula demonstrates that chronic diseases, because they last a long time, tend to produce high prevalence values.

Confirmation bias and the prepared mind People are routinely afflicted with confirmation bias, which is defined as the tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses. As is demonstrated in the courts, where eyewitnesses often disagree, people do not necessarily observe or remember events accurately. An example I used to use in teaching dental students the topic of observation description entailed displaying a slide (Fig 17-1) of a wood duck on the screen for several minutes as I described my adventures in getting the photo at the George C. Reifel Migratory Bird Sanctuary in Delta, British Columbia. I then turned off the slide and asked the students to write down a description of a wood duck. Observation is an active process, and as noted by Grinnell5 in The Scientific Attitude there is a difference between the abstract ideas of things and the specific examples of things. Nowadays, graduate or summer students arrive in my laboratory with ideas of cells as exemplified by ideal pictures or schematics of cells from textbooks but do not transfer these well to the jumbled assemblage of material found in cell cultures observed with phase contrast microscopy, so they have to be taught what is a cell, what is debris, what is (heaven forbid) yeast contamination, and how to distinguish a rounded up

Q: Why do ducks have large webbed feet? A: To stamp out burning forest fires. Q: Why do elephants have large flat feet? A: To stamp out burning ducks. The photo did not have much in the way of information that could be used to estimate the duck’s size, but in about a 50:50 mix, some identified the wood duck as a large duck, while others classed it as a small duck; thus the students provided information based on interpretation (as eyewitnesses sometimes do) rather than objective information. The students also varied widely in their description of the colors that were limited by their vocabulary. Not one in the hundreds of students who completed this assignment has used the word “iridescent,” which features in bird identification guides of the wood duck. A second feature of observation is exemplified by my own experience in photographing ducks that day. I decided that I wanted to take a picture of the colorful wood duck on the water. But I found it difficult to do; other ducks were always in the way. Eventually after I returned home, I happened to look at my Zippo wood duck lighter (Fig 17-3). This might be classified as a scientific illustration; that is, the artist is not always obligated to document an actual specific observation but rather is sometimes asked to show the typical features of the object being illustrated as an aid for identification or discussion. In this instance, the artist decided to picture the wood duck, at home as it were, and it turned out the wood duck’s home life is reminiscent of the 1950s sitcom Leave It to Beaver with the wood duck (playing the role of Ward Cleaver) proudly standing on a branch outside, and there poking her head

246

Brunette-CH17.indd 246

10/9/19 11:48 AM

Observation-Description Strategy

a

b

c

Fig 17-2 (a to c) Examples of three students’ answers to the wood duck assignment.

coyly from the door is his mate (playing the role of June Cleaver). As I looked at the illustration, I realized that the main reason I couldn’t get the picture of the wood duck on the water was because his somewhat dowdy mate was always in the way. As an incompetent, amateur bird watcher I had not prepared myself to look for the female wood duck, so I lost the opportunity to photograph a male-female wood duck couple. As Louis Pasteur famously observed, “Chance favors the prepared mind,” and my mind was unprepared. Before leaving the topic of wood ducks, I should add that the home life of a wood duck is quite spectacular. One day after hatching, the wood duck chicks, at the call of their mother, jump from their nest high up in the trees to land and follow her, which is a true leap of faith, but they survive.6 Two processes are at work. On the one hand, a scientist has to be prepared to make observations and is most often operating on some sort of theory on what is expected to be observed, thus opening the door to confirmation bias. On the other hand, scientists have to have a prepared mind to take advantage of the possibilities that chance throws before them. These problems of objectivity, relationship to theory, confirmation bias, and dealing with uncertainty pertain at the highest levels of scientific activity and intersect with problems of scientific behavior discussed in chapter 2. Keating7 provides a gripping first-person account of a race to understand cosmic microwave background (CMB) that was believed to hold the key to the problem of inflation that was thought to occur after the big bang. There was common agreement among cosmologists that those who succeeded in identifying a pattern

Fig 17-3 Zippo wood duck lighter.

would win a Nobel Prize. In brief, Keating was part of a team that thought they had detected the first direct evidence of inflation, that is “of the very birth pangs of the universe” in the form of a polarization pattern of the CMB viewed from their observatory in Antarctica using a detector BICEP2. However, Keating had doubts; he knew that no one is immune from confirmation bias and feared they were seeing what they hoped to see. Although they tortured their pattern every which way to detect possible flaws in their interpretation, they needed a different kind of data to confirm their interpretation. The BICEP2ers knew there was a potential problem with interstellar dust, and indeed the location of their observations in the patch of sky called the

247

Brunette-CH17.indd 247

10/9/19 11:48 AM

OBSERVATION

Southern Hole was chosen because the best available models predicted a low level of dust there. To be certain there was no dust problem, the BICEP2ers needed data at two times the frequency of that observed by their detectors. A further problem was that there was competition in the form of a $1 billion space satellite, Planck, parked one million miles above the Earth free from atmosphere contamination. The competing Planck people had such data but were unwilling to share. Moreover, to establish priority the BICEP2ers felt they had to publish quickly before the Planck people might scoop them. “Publish,” thought the BICEP2ers, “or else our Nobel dreams might perish.” But publication through standard journal channels was unlikely to be rapid because the only people who had the expertise to review their submission would be their competitors, and it was to competitors’ advantage to block or delay publication. In the end, the BICEP2ers decided to break scientific protocol and announce their findings prior to peer review in a press conference held at Harvard’s Center for Astrophysics. The announcement received wide publicity and was even hailed as “one of the most important scientific discoveries of all time”; even Keating’s mother’s Mah Jong partners and his children’s teachers were talking about the announcement. The BICEP2ers’ elation, however, was short lived; the Planck people published their data several months later that showed that BICEP2ers patterns were most likely caused by dust. In fact, the BICEP2 detector was just a very precise dust detector. The BICEP2/Planck story illustrates the problems when scientists ignore the norm of communalism. Both groups treated their data as if were their private property when in fact the people who paid for it were the taxpayers. The lack of cooperation between the groups resulted in great effort being wasted and misleading, subsequently retracted, information being released. Keating ironically titled his article in Nautilus “How My Nobel Dream Bit the Dust,” and his book Losing the Nobel Prize: A Story of Cosmology, Ambition, and the Perils of Science’s Highest Honor was published in 2018.

Operational considerations The observation-description strategy requires that, in order to be of value, observations must be detailed, systematic, and recorded as soon as possible after the event. Thus, dentists use a dental chart and systematically record their observations on the teeth of interest on the day the observations are made. The observations should be objective, that is, they must reflect

reality rather than the observer’s preconceptions or biases. During their undergraduate years, some dental students may suffer from requirement fulfillment bias; a condition where the student looks for and finds dental problems whose remedy entails procedures that are required for the student to graduate. Because it is difficult to be completely objective, an approach to this problem is to have multiple observers make multiple observations. In this way, the observers can check each other’s observations and agree on what occurred. In dental research, it is not uncommon to confirm observations either by repeated measurements made by the same clinician or by having different clinicians examine the same patient. Another approach, not often possible in clinical research, is to use naive observers, that is, individuals who, by virtue of their lack of training or interest in a topic, have no preconceived notions of what to expect. In some psychologic experiments (the ethics of which are debatable), observers or participants have been misled so that their behavior or observations will not be biased. Central to the observation-description strategy is the classification of observations, which imposes or perhaps invents the order or pattern of relationships between the classes. The ideal is for the categories not to overlap yet to accommodate all possible observations. Classification systems are driven by the theories used in their construction. For example, bacteria traditionally have been classified by their shape, nutritional requirements, and staining properties, as these properties enabled many bacteria to be classified unambiguously into families that made sense. With the advent of modern methods of DNA analysis, taxonomic relationships are now often determined by the homology of the organisms’ DNA, a much more fundamental property, which has led to the discovery of unsuspected relationships between apparently disparate species. The new technology and theories have led to the reworking of some bacterial family trees.

Advantages of the observationdescription strategy Practical and ethical considerations may determine that naturalistic observation is the only feasible strategy. Our knowledge of the growth and development of humans is based largely on material obtained from spontaneous abortions or accidental deaths. Discovering this information by experiment would require the execution of individuals at various times after conception and would obviously be unacceptable on ethical grounds. Another

248

Brunette-CH17.indd 248

10/9/19 11:48 AM

Observation-Description Strategy

advantage of the observation-description strategy is that it can be inexpensive, because elaborate equipment to control the environment is not required. Thus, many people, such as amateur bird watchers, are able to pursue this approach to scientific investigation.

Weaknesses of the observationdescription strategy There are several weaknesses associated with a straight descriptive study: 1. Comparison is made between observations made at different times. Because the observations are not simultaneous, there are more opportunities for other circumstances as well as conditions of interest to the investigator to be altered. If an investigator were making observations on dental disease in children in Canada during the past 40 years, comparisons would be clouded by factors such as the introduction or removal of fluoride in some communities, changes in ethnic composition, and any other social or economic changes that might affect dental health. 2. As discussed previously, many techniques of observation alter the observations themselves. For example, the preparation of tissues for microscopic examination involves a number of steps, and each step has the potential to produce some alteration of biologic structure. Similarly, psychologic observations are complicated by interactions between the interviewer and the subject. This problem, called observational reactivity, also occurs in experimentation. However, an investigator can include simultaneous controls in an experiment to lessen the effect of observational reactivity. 3. In contrast to experimentation, naturalistic observations can be made under only a limited number of conditions. For example, to assess the carcinogenicity of chemical compounds, laboratory animals are exposed to concentrations of chemicals that are much higher than any they would encounter in their natural environments. These high doses enable effects to be seen that would not be observed in nature. 4. The selection of observations is a complex problem involving both sampling and method. In some instances, it may be possible to get complete (100%) coverage of the population of interest, but, more commonly, a sample must be selected. In that event, the investigator must decide who will select

the sample, what tests or measures will be used as criteria for inclusion or exclusion of subjects or objects, and how a random selection of the sample can be obtained. The selection of the sample can be biased in various ways so that it does not represent the parent population. The selection of the observations to be made is crucial to the observation-description approach. There are so many possible things to observe that observations, by necessity, are restricted to those believed to be most relevant to the question of interest. The questions posed by investigators are influenced by their training, which, in turn, is influenced by the prevailing concepts accepted in their subject disciplines. This can be seen in periodontal research in the “bug-of-the-month club.” As research implicates different microorganisms in the pathogenesis of periodontal disease, subsequent investigations in the microbiology of periodontal disease often include the new putative pathogens among the microorganisms studied. The background of the research worker who asks the question will affect the type of observation that is made. Despite the pain-alleviating measures available to dentists, many patients still harbor anxiety about dental treatment. Investigation of the anxiety could be legitimately undertaken by investigators with different backgrounds, who would make different types of observations. A psychiatrist might investigate anxious patients’ attitudes to authority or their experiences with weaning and toilet training, whereas a behavioral psychologist might use a polygraph to measure physiologic parameters during dental visits. The psychiatrists and psychologists would interpret their findings in very different manners based on the prevailing theories of their respective disciplines. The point here is that their backgrounds determine what observations they make. The sophistication of an observation-description study can be judged by the refinement of the observational tools. Bird watchers can wander around, making random observations and disturbing the birds, or—in a more sophisticated study—they can hide in blinds, making observations at specific locations in a defined sequence. Moreover, the observation-description strategy can be driven by refinements in technology, because these make more detailed observations possible or enable previously inaccessible observations to be made. The Hubble Space Telescope can make observations that cannot be made on earth. New methods of the assessment of oral disease will probably make it possible to assess oral health in areas of the world where this was previously not feasible.

249

Brunette-CH17.indd 249

10/9/19 11:48 AM

OBSERVATION

Table 17-1 | Criteria for judging quantitative research and qualitative research10 Quantitative research

Qualitative research

Internal validity

Credibility

External validity

Transferability

Reliability

Dependability

Objectivity

Confirmability

Characteristics of real-world research Real-world research, sometimes called naturalistic inquiry, is the only possible approach to gain insight into the complex, messy, and poorly controlled situations that are nonetheless central to understanding the oral health (or lack thereof) of populations. Case study is “the strategy for doing research which involves an empirical investigation of particular contemporary phenomenon within its real-life context using multiple sources of evidence.”8 The case is studied in its own right and may yield results that do not necessarily generalize to a larger target population. Real-world research tends to emphasize solving problems and predicting effects, rather than finding causes and quantitative relationships and developing theories.5 Naturalistic inquiry has the following characteristics8,9: • Data are collected in a natural setting, and humans are the primary data-gathering instruments. • Investigators collect qualitative data, and collection can be sensitive, flexible, and adaptable. Data collection often employs open questions. This approach ensures that there is no need to push square observational pegs into round classification holes—as can happen with survey-based research. • Naturalistic inquiry emphasizes full descriptions and interactions between the investigator and the respondents; analysis can be secondary. • Use of tacit (intuitive or felt) knowledge is considered legitimate. The investigator’s personal experiences and insights are part of the investigation. • Sampling is often purposive rather than random, because purposive sampling allows the widest range of data to be collected, and it can be adapted as hypotheses emerge from the observations. Investigators accept that one cost of purposive sampling is that standard statistical methodologies used to generalize to target populations cannot be applied to the data. Instead, there is often a tentativeness in generalizing the data.

• Data are analyzed inductively. Specific responses and interactions are examined closely to discover important categories and relationships. • A grounded theory, in which the theory emerges from (ie, is grounded in) the data, may be used, rather than deductively determining whether some principle or law is being followed. • The focus of questions may change as more information becomes available during a study—a process sometimes called emergent design. • Negotiated outcomes are common, as credibility generally requires that the participants agree with the investigator’s interpretation; meanings are often explicitly negotiated with respondents. • Idiographic interpretation is necessary, as data are interpreted in terms of the particulars of the case, rather than conformance with laws. • Focus-determined boundaries are observed. • Special criteria for trustworthiness are devised that are appropriate to the form of the inquiry. Naturalistic inquiry often uses qualitative methods, but quantitative methods may also be involved. Although qualitative and quantitative data are often described as if there were an unbridgeable dichotomous divide between them, Trochim10 has argued that there is really little difference, as all qualitative data can be coded quantitatively, and all quantitative data are based on qualitative judgment. Qualitative data come from various sources, including in-depth interviews; direct observation analogous to that used in studying animal behavior, where one observes but does not question the subject; and analysis of documents or other cultural artifacts. Trochim10 suggests that criteria different from those used in quantitative research should be applied for evaluating qualitative research studies (Table 17-1). Credibility relates to the requirement of the researcher to provide sufficient information on the methods used and their justification. Moreover, as the objective of qualitative research is often to provide a description or understanding of phenomena from the subjects’ perspective, the subjects are in the best position to make this judgment and should be consulted. Transferability, according to Trochim, “refers to the degree to which results of qualitative research can be generalized or transferred to other contexts or settings.”10 As noted earlier, transferability requires methods that differ from those used in producing statistical generalizations. One approach is to describe the assumptions and contexts in detail so that others can determine how well they relate to other situations of interest. Another is “making the case,” where

250

Brunette-CH17.indd 250

10/9/19 11:48 AM

Observation-Description Strategy

persuasive techniques are used to argue that it is reasonable to generalize. Dependability, Trochim10 argues, emphasizes the need for the researcher to account for the context in which the research is done and how it can change findings. Although perfect replication of conditions is generally not possible, dependability is most directly demonstrated by doing another study in another setting and coming to similar conclusions. Confirmability relates to the question of whether other investigators can confirm the findings. In naturalistic inquiry, a wide variety of strategy types are used, including the following8–10: • Endogenous research: Conducted by “insiders” of a culture using their own epistemology and structure of relevance. • Participatory action research: Participants contribute to formulation of the research question, study design, and analysis. • Critical theory: An epistemology (ie, study of knowledge and how it is justified) that often involves concepts concerning social justice and social change. • Phenomenology: Uncovers meaning of how humans experience phenomena through descriptions of individual experiences. • Heuristic design: Immersion of an investigator into a problem and self-reflection of investigator’s personal experience. • Life history: Involves eliciting life experiences and how they are interpreted. • Ethnography: Description and interpretation of cultural patterns of groups (anthropology). • Grounded theory: Generation of theory by inductive process of constant comparison.

Analysis and interpretation of qualitative research Qualitative research shares many of the characteristics of quantitative research; there has to be clear thinking to determine what is at issue, what other relevant studies bear on the problem, and what are the factors that threaten the validity of the conclusions. As in any logical processing of thought, there should be balance, some consideration of facts that do not fit the interpretation, and revision of explanations when unsupportive data are found. Chronology is important in suggesting possible cause-effect relationships, and various means, such as triangulation, are used to determine the reliability of the data. Greenhalgh and Taylor11 suggest

that a good qualitative study will address a problem through a clearly formulated question. Quoting Nicky Britten, they state that there is a real danger that “the flexibility of the iterative approach will slide into sloppiness as the researcher ceases to be clear about what it is (s)he is investigating.” Qualitative data—like quantitative data—must be analyzed systematically, and there is a need to identify, locate, and explore examples that contradict the emergent hypothesis. As with quantitative methods, analysis of qualitative data should be done using explicit, systematic, and reproducible methods. Quality controls can be introduced into qualitative research. Researchers generally agree that the validity (closeness to the truth) of qualitative research benefits by having different researchers independently analyze the data (one form of triangulation). It would be of interest for qualitative investigators to determine interobserver agreement or disagreement, as this would test the assumption sometimes made that the interpretation of subject meanings is self-evident. However, quantitative data has a way of imposing its own discipline in demonstrating conclusions. In determining whether the difference between groups could be explained by chance, there is a set of statistical conventions that, more or less, must be obeyed to avoid censure. The fact that some of the assumptions in the analysis may not hold for the collected data can be disturbing, but the presentation of the data and the interpretation of differences tend to be standard. In contrast, authors of reports using qualitative data remind me in some ways of scientist-essayists such as Francis Bacon, Lewis Thomas, and Stephen Jay Gould. The author considers various interpretations of the data and argues for the most likely. The essayist has considerable latitude on the aspects that will be emphasized and the rhetorical techniques that will be brought into play. Among the more common approaches taken by writers of case studies are consideration of the characteristics of language (as qualitative data are words, and there is a focus on word choices), discovery of regularities in participants’ views, close examination of the texts’ meanings, and extensive reflection.8 The conclusions might or might not be related to a theoretical framework; sometimes pure description might be adequate. Logical argument is employed to demonstrate the plausibility of the conclusions. Relative to quantitative research, writing and rhetorical skills assume increased importance, because they may have to bear more of the load in the demonstration of qualitative research conclusions than they bear in quantitative research, where the data can sometimes speak for themselves.

251

Brunette-CH17.indd 251

10/9/19 11:48 AM

OBSERVATION

Table 17-2 | Comparison of qualitative and quantitative research methods in terms of purpose, function, and findings14 Approach

Qualitative

Quantitative

Exploratory

Ideal

Suitable

Descriptive

Suitable

Ideal

Inappropriate

Ideal

Understanding, depth

Ideal

Inappropriate

Hypothetical, interpretive

Ideal

Suitable

Empirical, statistical

Inappropriate

Ideal

Prediction, accuracy, breadth

Inappropriate

Ideal

Ideal

Suitable

Inappropriate

Ideal

Purpose

Function Causal

Findings Implications, problem identification Projections

Perhaps the stickiest question faced by qualitative investigators is the issue of generalizability. On the one hand, qualitative researchers tend to argue for the importance of subjectivity and the particulars of the case. On the other hand, qualitative researchers— like quantitative researchers—need money to pursue their projects. Information that can be generalized to other settings is obviously more attractive to funding agencies than are pure descriptive studies. Thus, qualitative researchers claim that generalization is beside the point, but circumstances sometimes dictate that they must generalize anyway. Paley12 describes their dilemma as follows: Like other researchers, they want to talk in generalizable terms about reality; they want to be objective; they want to do theory. But they are saddled with a philosophy that is disabling, because they can only talk about perceptions, and meanings and uniqueness. Some naturalist observers are acutely aware of the difficulty in selling qualitative research to granting agencies that want a well-defined outcome. Sandelowski et al13 note that qualitative researchers employing emergent research design must “negotiate the paradox of planning what should not be planned in advance.” Their advice to dealing with this paradox is not unlike that given to quantitative investigators:

Provide a rationale for your decisions. For example, when purposive sampling is employed, they advise researchers to identify the purpose of sampling, determine the first subject to be examined, describe focal and comparison subject groups, specify the rules governing how sampling decisions will be made, and clarify that no claim should be made that the selected sample can be used for statistical calculations that assume random probability sampling. Similarly, procedures should be specified for data collection, management, and verification.

The Value of Qualitative Methods in Dental Research Increasingly, qualitative methods are being used in research, education, and policy related to oral health. Table 17-2, taken from Gift,14 contrasts the qualitative and quantitative research methods in terms of purpose, function, and findings. In general terms, qualitative research deals with the subjective world of individuals as they describe it, directly and indirectly. While quantitative research can assess outcomes such as decayed, missing, and filled teeth or periodontal status, qualitative research can give insight into why the data are what they are by exploring such aspects as an individual’s motivations, life history, and perceptions. Qualitative approaches can help develop concepts, suggest hypotheses, and identify problems, which might be used in subsequent studies that could include quantitative measures. If researchers using qualitative measures are sometime criticized for inappropriate generalizations, investigators using quantitative measures can be equally culpable of going beyond their numbers and forming conclusions about subjective perceptions of their subjects. Three examples of how qualitative methods can be used in dental research follow.

Problem identification When I was 8 years old, I noticed that a friend of my grandmother, Mrs Winpenny, who worked in the local bakery, had beautiful white teeth. When I mentioned this to my mother, she replied, “Well, they’re false, of course.” She then told me that Mrs Winpenny had wanted white teeth all her life, and when she finally needed a complete set of dentures, she had fought her dentist tooth and nail to get dentures that, to adults, looked like a set of “Chicklets” but that, to a

252

Brunette-CH17.indd 252

10/9/19 11:48 AM

Observation-Description Strategy in Clinical Interventions

child, looked wonderful. Later, as a young man, I recall being disappointed that my teeth were not any whiter after I had them cleaned at a dental office. Thus, I had both personal and interpersonal evidence that people preferred bright white teeth, even though the dental profession largely ignored their desires and promoted a more natural-looking esthetic. My qualitative approach to discovering this insight might be classified as “life history.” To broaden my base of inference, I could easily have adopted an “ethnographic” approach by looking at the social context of toothpaste advertisements and discovering the Pepsodent commercial dating back many years that assured consumers, “You’ll wonder where the yellow went, when you brush your teeth with Pepsodent.” To broaden my base of inference further, I might well have embarked on purposive sampling, by asking those friends whose teeth did not exactly sparkle, “Yo, yellow fang, do you wish your teeth were whiter after you visited the dentist?” Or perhaps more diplomatically and in the qualitative research tradition, I could have asked the more open-ended, neutral question, “What do you think about the teeth-cleaning services provided by your dentist?” A more commercially oriented mind might have taken this insight on unmet need that was developed by qualitative methods and worked on developing and marketing tooth-whitening products. Before taking out a mortgage on my home to embark on this career, it probably would have been wise to conduct a survey to determine quantitatively the prevalence of people who would pay for tooth whitening. Such action would have illustrated the principle that the insights arising from qualitative studies can lead to quantitative studies with appropriately focused questions. As it happened, I did not do any of these things, and this is one reason that today I am an impecunious penny-a-word author of a Quintessence book. Although the development of tooth-whitening systems has subsequently happened in spades, the opportunity was open for quite a long time. I would think that qualitative research methods could still be employed to discern consumer preferences that are not expressed, or at least not appreciated, by the dental profession.

al.15 By asking older adults the open-ended question “What is the significance of oral health in the lives of older adults?” and by collecting and analyzing their unrestricted responses, the investigators identified three prominent themes: comfort, hygiene, and health. They found that older patients recognized the need to adapt but did not complain that oral health was a cause of social embarrassment. The subjects’ beliefs about the need for treatment did not always agree with the recommendations of health professionals, and home remedies were used to explain why dental visits were unnecessary. Overall, it was clear that the subjects had a rather different view of the maintenance of their oral health than that typical of dentists. Indeed, the investigators concluded that most oral disorders can be managed so that the impact on quality of life is within the adaptive capacity of aging adults—a finding that recommends offering different approaches of oral health care delivery to this group than those currently being applied most commonly.

Supplementing and complementing quantitative approaches Typically, qualitative research examines relatively few subjects, as the investigators must explore the words and perceptions of their subjects in detail. But it is clearly inappropriate to generalize from close observations of a few subjects to the general population. One method of avoiding this problem is to combine qualitative and quantitative approaches to a single problem. For example, Moore16 combined qualitative and quantitative approaches to the study of pain by triangulation, ie, a process of verifying the same phenomenon using different measures. Moore combined interview, observation, and focus group data with short surveys to gain insight into the perception of pain by different ethnic groups. In addition, there were ethnic differences in the use of local anesthetic for tooth drilling. Different conceptions of ethnic groups on the nature of pain led to differences in how the pain was managed, a finding that introduces a new consideration into chairside care.

Identifying perceptions that affect the delivery of oral health services

Observation-Description Strategy in Clinical Interventions

One example of qualitative research that produced interesting results, which could not have been found by quantitative methods, is the study by MacEntee et

The closest approach to the observation-description strategy in clinical investigation occurs in those instances where the investigator desires to provide

253

Brunette-CH17.indd 253

10/9/19 11:48 AM

OBSERVATION

some documentation and publication of clinical experience. It is normally understood that, although such studies may not be definitive, they represent the starting point for explanations. This type of study occurs in a number of formats.

The case study S

Tx

t

O1

O2

. . . On

where S = pool of eligible subjects, Tx = treatment, and O = observations. In a case study, subjects are treated and carefully observed. Inferences are made on the basis of historical controls, that is, what would have been expected if treatment had not taken place or if a standard treatment had been used. Historical controls are generally regarded as less than ideal, because time and history can play important roles, such as the incremental improvements in dental equipment, materials, and techniques, which would be expected to improve outcomes. The case study is not an example of the observation-description study in the purest sense, because a treatment is applied. However, in some instances, the treatment is traditional or unproven and could be dealt with as observation on an individual encountering natural hazards. The case study could be considered a quasi-experimental design, but it is not a true experiment, because the subject (the patient) is not selected randomly and there is no control group. It resembles an experiment in that a treatment is applied and the subjects are observed closely. Regardless of its exact location in the investigational design taxonomy, the case study is worth discussing here. The case study is almost universally reviled in books on experiment design primarily because, without a control group, it is impossible to tell what would have occurred without treatment. This defect cannot be remedied by the investigator collecting an increasing number of successful outcomes. Streiner et al17 noted that case reports involving a total of about 1,500 patients indicated that gastric freezing would cure gastric ulcers. Subsequent properly controlled clinical trials demonstrated that the procedure was useless. Despite shortcomings, one could argue that, were it to be present, a truly dramatic treatment effect could likely be observed. Moreover, it must be admitted that this approach is most often used in everyday problem solving, and to a certain extent, under some conditions,

it works. It works best when the effects of the treatment are large and easily observed and occur quickly. Under these conditions, some of the major problems of the design, such as the effects of time and subjective assessment, are less likely to confuse the results. To do a case study is obviously better than refusing to investigate at all, because there is at least some chance that new information will be gained, and it does encourage close observation. This design is most effective if the sequence of events in the absence of treatment is absolutely predictable. Observations from case studies have led to true experimental studies being carried out, and it is generally true that it is better to light one candle than to curse the darkness. Nevertheless, Huth18 believes that only three types of case report merit publication: 1. The unique case. Such a case occurs when a patient exhibits disease manifestations so extraordinary that they cannot be accounted for by known diseases or syndromes. 2. The case of unexpected association. When two relatively uncommon diseases are found in one patient, it is possible that their association indicates some kind of causal relationship. There may be some underlying mechanism— for example, a deficiency in the immune system—that explains both. 3. The case of unexpected events. An unexpected event may provide a clue to new information. A drug might appear to cause some unexpected benefit or some previously unsuspected adverse effect. If the author wishes to claim a causal relationship, the author must exclude plausible alternative explanations.

Improvements of the case study design by multiple observations A significant problem of the case study design is that the patient might have improved without treatment. It would be worthwhile for the investigator to determine that the conditions would remain the same in the absence of treatment. For conditions in which there is no urgency for treatment, stability may be established by measuring the baseline; that is, by making multiple observations before treatment and then multiple observations after the treatment. The following is a shorthand description for this design: S

O1

t

O2

O3

Tx

O4

t

O5

O6

254

Brunette-CH17.indd 254

10/9/19 11:48 AM

References

The pattern of observations O1, O2, O3 is compared with the pattern of O4, O5, O6. If the baseline O1, O2, O3 is reasonably stable or has a constant slope, it may be possible to see the changes after the treatment and to be more confident that these changes were brought about by treatment, although other explanations would still be possible, because other effects such as maturation and history are not controlled in this design. For treatments that are reversible, a further refinement is possible of discontinuing the treatment to allow sufficient time for the baseline values to be established again and to reapply the treatment, that is the following: t

O2

O3

Tx

O4

Recovery time

O7

O8

O9

S O10

O1 O11

t

O5

O6

Tx

of the possible causes were related to the final effect. Mainland20 lists two additional problems of retrospective studies: (1) records made for clinical purposes are seldom suitable for any profound analytic study, and (2) clinicians can lose touch with many patients over time. Despite these problems, a recent review found that the case series is a frequently adopted design and figured in some 30% of the Health Technology Assessments of the National Institute for Clinical Excellence (UK).21 Agnew and Pyke22 have likened the retrospective analysis of cases to a murder investigation in which there are various suspects who are eliminated by different criteria, such as motive and opportunity, so that only the prime suspects are left. However, they stress that the after-the-fact method is used only when it is impossible, too difficult, or too late to experiment.

O12

The basic advantage of this design is that the effect of the treatment can be replicated. Moreover, if the effect occurs at different times and at different stages as the subjects age, the effects of history and maturation are probably not important. Further refinements in this family of n = 1 (ie, single subject) designs are discussed by Hersen and Barlow.19 They review tactics such as having multiple baselines (ie, establishing baselines for more than one variable with only some of the variables likely to respond to the treatment) and multiple treatments that might affect different variables. Despite these refinements, it is not possible to generalize from the case study because of individual variation, and there is no way of knowing whether the subject of the study is typical of those who have the condition. The case series is one attempt to improve generalizability.

Case-series analysis Similar to the case report is the case-series analysis, a retrospective study of case records usually gathered in one institution or practice. Investigators then make some generalizations on the basis of their cases. The case-series analysis suffers from some of the same failings as the case report. In the first place, it is often difficult to ascertain how a patient came to be treated at a given institution or practice, but it is highly likely that the cases in the study are not selected randomly. Another problem is that with the passage of time, patients are exposed to many different treatments or conditions, and it can be difficult to determine which

References 1. 2. 3. 4. 5. 6.

7.

8. 9. 10. 11. 12. 13.

14. 15. 16. 17.

Einstein A. Cited in: Brush SG. Should the History of Science Be Rated X? Washington: Science, 1974:1164. Beveridge WIB. The Art of Scientific Investigation. New York: Random House Vintage Books, 1950:140. Cawson RA, Stocker IP. The early history of fluorides as anti-caries agents. Br Dent J 1984;157:403–404. Pennisi E. Audubon count now serves science too. Scientist 1989;11:1–4. Grinnell F. The Scientific Attitude, ed 2. New York: Guilford Press, 1992:6–18. National Geographic. Animal Mothers: Wood Duck Paratroopers. https://video.nationalgeographic.com/video/duck_leavingnest Accessed 17 May 2018. Keating B. How my Nobel dream bit the dust. My team thought we’d proved cosmological inflation. We were wrong. nautil.us/ issue/59/connections/how-my-nobel-dream-bit-the-dust. Accessed 17 May 2018. Robson C. Real World Research. Oxford: Blackwell, 1993. Tuckman BW. Conducting Educational Research, ed 4. Fort Worth, TX: Harcourt Brace, 1994. Trochim WMK. The Research Methods Knowledge Base, ed 2. Cincinnati: Atomic Dog, 2001. Greenhalgh T, Taylor R. Papers that go beyond the numbers (qualitative research). BMJ 1997;315:740–743. Paley J. Phenomenology as rhetoric. Nurs Inq 2005;12:106–116. Sandelowski M, Davis DH, Harris BG. Artful design: Writing the proposal for research in the naturalist paradigm. Res Nurs Health 1989;12:77–84. Gift HC. Values of selected qualitative methods for research, education, and policy. J Dent Educ 1996;60:703–708. MacEntee MI, Hole R, Stolar E. The significance of the mouth in old age. Soc Sci Med 1997;45:1449–1458. Moore R. Combining qualitative and quantitative research approaches in understanding pain. J Dent Educ 1996;60:709–715. Streiner DC, Norman GR, Blum HM. PDQ Epidemiology. Toronto: Decker, 1989:viii.

255

Brunette-CH17.indd 255

10/9/19 11:48 AM

OBSERVATION

18. Huth EJ. How to Write and Publish Papers in the Medical Sciences. Philadelphia: ISI Press, 1982:58. 19. Hersen M, Barlow DH. Single Case Experimental Designs: Strategies for Studying Behavior Change. New York: Pergamon, 1976. 20. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963:25.

21. Dalziel K, Round A, Stein K, Garside R, Castelnuovo E, Payne L. Do the findings of case series studies vary significantly according to methodological characteristics? Health Technol Assess 2005;9:1–146. 22. Agnew NM, Pyke SW. The Science Game. Englewood Cliffs, NJ: Prentice Hall, 1978:69.

256

Brunette-CH17.indd 256

10/9/19 11:48 AM

18 Correlation and Association

Association is a statistically discernible relationship between two or more measured variables (or counts). Correlation is the term used when the relationship between the variables is linear. Correlation does not imply (in the sense of being sufficient to prove) causation. For example, smoking is associated with oral cancer and with tooth staining. Consequently, through their common cause, smoking, there would likely be a statistical association between tooth staining and oral cancer. However, if the stain were removed from the teeth, the risk of oral cancer would still be increased if the individual, now with pearly white teeth, continued smoking. If the individual stopped smoking, however, his or her risk of oral cancer would be reduced.According to Edward Tufte,2 the shortest true statement about correlations and causation is that “Empirically observed covariation is a necessary but not sufficient condition for causality,” or, expressed more succinctly, “Correlation is not causation but it sure is a hint.” Even sophisticated methods of exploring relationships, such as multiple regression and multilevel modeling, may be effective for prediction but may still be misinterpreted for causal inference. The method used to calculate association of exposure to a risk factor and disease depends on the type of data and the study design. The investigative strategies outlined in this chapter include the following: (1) cross-sectional survey, (2) ecologic study, (3) case-control design, and (4) follow-up (cohort) design. More detail on the issues and calculations involved can be found in Brunette3 and the standard statistical texts referenced in chapter 10. In addition, appendix 8 presents some of the standard terms commonly used in discussing association and correlation.



The scientific literature is a large graveyard for correlations that didn’t ‘pan out’ when more data became available.” PETER A. LARKIN 1

257

Brunette-CH18.indd 257

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

Table 18-1 | Correlational cross-sectional study measuring Streptococcus mutans biotype, DMFT, and caries prevalence S mutans biotype group

No. of subjects

DMFT

Caries-free subjects

Mean

SE

No.

%

3.16

0.37

17

26.6

c

64

d

22

2.36

0.39

5

22.7

e

14

3.93

0.88

1

7.1

Single (c or d or e)

100

3.09

0.28

23

23.0

Multiple (eg, cd, ce)

96

4.76

0.40

11

11.5

e-type carriers

56

4.91

0.48

4

7.1

Non-e-type carriers

140

3.51

0.28

30

21.4

DMFT, decayed, missing, or filled teeth; SE, standard error. Data from Keene et al.4

Cross-sectional Survey S → OP1 → r OP2 OPn where S = pool of eligible subjects, OP1 = observation of property 1, OP2 = observation of property 2, OPn = observation of property n, and r = calculation of correlation coefficient or other measure of association. As an example of a correlational cross-sectional study in dental research, consider the data from Keene et al in Table 18-1.4 The correlational strategy is used to relate two or more variables. Investigators classify the study subjects into groups and examine each subject individually to measure the different variables. In this example, investigators measured three properties: the Streptococcus mutans biotype; decayed, missing, or filled teeth (DMFT); and caries-free status. Often in cross-sectional studies, one variable is concerned with exposure (S mutans), while the other variables relate to “caseness” (ie, presence of some conditions). Each group member is then assigned a value, depending on the particular variable measured. The degree of relationship between the variables is then assessed with statistical techniques. Notice that because the Keene et al study is a correlational study, no attempt is made to manipulate the biotype of S mutans in the naval men. The biotype is simply measured. Hence, a correlational study is similar to the observation-description strategy and shares the strengths and weaknesses. Two advantages of a simple correlational study (such as that of Keene et al4) are that (1) they are relatively inexpensive because no follow-up is required, and (2) subjects are not exposed to potentially harmful agents or conditions.

Statistical considerations and calculations Association studies have statistical considerations. The groups formed from exposure and outcome, such as caries-free individuals exposed to e-biotype S mutans, could end up with very different sample sizes, and statistical efficiency could be poor. Various calculations, such as the odds ratio (OR), can demonstrate the presence or absence of a relationship between the two variables under study (in our example, caries and S mutans biotype). Some calculations may be more appropriate than others. The criticism of statistical tests used in correlation involves methods outside the scope of this book; detailed information is presented elsewhere.5–8 However, two simple statistical approaches are given here: the OR and the contingency table analysis using the Cohen kappa.

Categorical data: The 2 × 2 table The association between exposure and disease is assessed with two general questions: 1. Is there a real effect of the exposure on disease, or could the data be explained adequately by chance? 2. What is the size of the effect? In the simplest cases, these basic questions can be answered with a 2 × 2 table (see Table 18-3). The count data in the table are categorical (nominal) scale data that sort variables according to category, such as presence or absence of the e-biotype or presence or absence of caries. One test for determining relationships with such categorical data is the chi-square analysis, designated by the Greek letter χ. Chi-square tests whether

258

Brunette-CH18.indd 258

10/9/19 11:54 AM

Cross-sectional Survey

Table 18-2 | Odds for caries using data from Table 18-1 Factor

Caries present

Caries absent

Odds

e-type carriers [+]

52

4

13

Non-e-type carriers [–]

110

30

3.7

Table 18-3 | Contingency table using data from Table 18-1 Factor

Caries present

Caries absent

Marginal row total

e carriers [+]

a 52

b 4

a+b 56

Non–e carriers [–]

c 110

d 30

c+d 140

a+c 162

b+d 34

a+b+c+d 196

Marginal column total

the actual observed counts differ significantly from what could be expected by chance. The table itself is called a contingency table. The marginal totals of the cells (designated a, b, c, d), which are the characteristics of the whole sample, are used to calculate the expected values in each cell in the contingency table. Then the chi-square (χ2) statistic is calculated as the sum of the value (square of the [observed – expected] value divided by the expected value) for the four cells of the table. This calculation was done in chapter 9 for this data, and yielded a χ2 value of 4.7, which was greater than the critical χ2 value of 3.84 (for an alpha level of 0.05 and degrees of freedom = 1), and we can conclude that there is a statistically significant association between possession of the e-type biotype of S mutans and caries.

Odds ratio Once it is known that the effect is real, the size of the effect can be determined. A useful approach for analyzing these kinds of data from case-control studies is the odds ratio (OR), calculated as the ratio between the odds of experiencing the disease after exposure to a risk factor and the odds of experiencing the disease with no exposure. The OR is a widely used means that can show the relationship between factors and disease. Odds are defined as the ratio that something is so, or will occur, to the probability that it is not so, or will not occur. In looking at the odds, we could express the data in Table 18-1 in a different way (Table 18-2). For e-type carriers, the odds of having caries are 52/4 = 13. For non-e-type carriers, the odds are 110/30 = 3.7. The OR—the ratio of

one odds to another—is 13/3.7. An OR of 1 means there is no association between the factor and the disease. Here, the OR is greater than 1, indicating that there is an association between having the S mutans e-type and the disease. However, the association is weak, as the 95% confidence interval for the OR is 1.19 to 10.59 (calculated using the MedCalc calculator available at (www.medcalc.org/calc/odds_ratio.php). Note that the interval does not include 1.0 and that, as expected from our calculation above, the MedCalc calculation reported the significance level as P = .02 (ie, < .05).

The Cohen kappa In the view of Norman and Streiner,9 the best measure of association for a contingency table with an equal number of rows and columns is the Cohen kappa. It is defined as the following: κ=

observed agreement – chance agreement 1 – chance agreement

As noted earlier, the data on caries presence or absence in relation to e-biotype can be extracted from Table 18-1 in the form of a contingency table (Table 18-3) (see also chapter 10). The Cohen kappa concentrates on the diagonal cells. That is, the upper left cell, a (positive for caries and positive for e-type), would be expected to have a high frequency if there were a strong association between e-type and caries, as would the lower right cell, d (absence of e-type and absence of disease). If there were a perfect association, all the cases would be found only in those two quadrants. However, in Keene et al’s data,4 the sum of quadrant a (52) with quadrant

259

Brunette-CH18.indd 259

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

d (30) divided by the grand total (196) is 0.42 of the sample; this is the observed agreement for the association. If there were no association between e-type and caries, the individuals would be distributed into the four cells by chance, and the number in each cell could be determined by using the marginal totals as follows: (56 × 162) = 46.3 cases 196 (140 × 34) = 24.3 cases Quadrant d would contain 196 Quadrant a would contain

The proportion of cases expected to be distributed into the a and d quadrants by chance is thus: No.expected in 2 cells by chance 46.3 +24.3 = = 0.36 196 Total no.in samples Therefore, for these data:

κ = 0.42 – 0.36 = 0.09 1 – 0.36

For a perfect association, κ = 1, and for no association, κ = 0. Thus, the calculated association by Cohen kappa, namely 0.09, between e-type and the presence of caries is weak.

Interpretation of correlation Authors can interpret correlation to predict future events. In our example, if the next Saudi naval recruit we saw was a carrier of the e-biotype S mutans, we might predict on the basis of the data in Table 18-1 that the recruit’s DMFT index might be higher than that of a non–e carrier. This is a legitimate use of the data. A second, and less conservative, use of the data would be to assert that the e-biotype S mutans is more effective than other serotypes in causing caries. The data are compatible with such a proposal, but the authors carefully avoid asserting it. Instead, they state, “it will be most interesting to determine whether these relationships can be demonstrated under carefully controlled laboratory conditions in the animal model.”4 The reason for their caution is the fallacy of concomitant variation, which is the fallacy of assuming that Mill’s principle of concomitant variation is necessarily true. While it is possible that two events showing a high incidence of correlation are causally connected, it is not always the case. For example, Huff10 states that there is a close relationship between the salaries of Presbyterian ministers in Massachusetts and the price of rum

in Havana. However, there is no reason to suspect that the one influences the other. Thus, when applying the principle of concomitant variation, we must remember Hume’s second criterion for causation11 (see chapter 8): There must be some plausible reason explaining cause and effect. Another problem with correlational studies is the linked or confounding variable (or confounder). Sometimes, when varying one condition, an investigator also varies another factor, either knowingly or unknowingly. The hoary example used to illustrate this problem relates the story of a philosopher who started out drinking scotch and soda at a party. When the host ran out of scotch, the philosopher switched to rye and soda, and finally to bourbon and soda. Carefully selecting the best correlation, the philosopher concluded that soda water was the cause of his intoxication. This example illustrates the common problem of a confounder producing an apparent association (drunkenness and soda) where none actually exists, but confounders can also diminish, reverse, or exaggerate an apparent association. The effect of any confounder can be important only if its association with the effect of interest is strong. The most common strategy used by epidemiologists to eliminate confounders begins with consideration of possible confounding variables that might be associated with the independent variable of interest. A number of variables that are often relevant to epidemiologic studies would be investigated, including sex, parity, ethnic group, religion, marital status, social class, education, occupation, rural or urban residence, and geographic mobility. The investigator then regroups the data to see if the possible confounder has any effect on the association. A study on the prevalence of root caries for individuals living in an extended-care facility might show that caries was more prevalent in women than men. However, subsequent analysis might demonstrate that the association was modified by the sample containing more elderly women than men (because women live longer) and that prevalence of root caries was related to age, not sex. Investigators generally accept that the best way to elucidate causation is to vary experiment factors separately, and scientists are skeptical when surveys are substituted for experiments. In a survey, investigators try to define their groups precisely, but it is always possible that two variables will be confounded and that an investigator will attribute the effect to the wrong cause. In the example of the drunken philosopher, the linked variables were alcohol (the hidden variable) and soda.

260

Brunette-CH18.indd 260

10/9/19 11:54 AM

Cross-sectional Survey

A second difficulty arises because all the data are collected at a single point in time. This leads to problems in interpreting cause-effect relationships, because—as David Hume11 noted—one of the criteria often used for establishing cause-effect relationships is that the cause precedes the effect (see chapter 8). In cross-sectional studies, interpretation becomes a chicken-and-egg problem. Gehlbach12 has used the example of findings from cross-sectional studies that demonstrate that children who are overweight are less active than their normal counterparts. The conclusion drawn from the studies is that children who have a low level of activity are more likely to become obese. However, it could be equally inferred that obese children have difficulty getting around and are inactive because they are obese rather than obese because they are inactive. Gehlbach12 notes that the cross-sectional design is efficient and flexible and has been increasingly used in medical research. A common problem is sample selection (eg, study populations drawn from hospitals sometimes bear little resemblance to the community at large). Despite its apparent simplicity and wide use, many people find OR difficult to comprehend and confuse OR with relative risk. When the outcome has a low probability, this confusion is not problematic because the values are similar, but for high-probability outcomes (eg, ≈ 55%) the OR is a poor measure of relative risk.13 Finally, the crude estimate of OR based on the 2 × 2 table does not take into account the influence of other risk factors; however, the OR may be adjusted for the presence of risk factors by use of logistic regression to produce an adjusted OR. The OR can be used in clinical decision making. In a case-control study in which the cases experienced dental implant failure, Yip et al14 found an adjusted OR of 2.69 (adjustment to be discussed later) for middle-aged women who had been treated with oral bisphosphonates. They concluded that their results supported a recommendation for discontinuation of oral bisphosphonates in long-term users to allow for recovery of bone remodeling. In addition to the cross-sectional study, a number of study designs use the basic correlational strategy but differ in the way the data are collected over time or the way subjects are selected. A brief description of the merits and deficiencies of the more common designs is given in later sections of this chapter; the interested reader is referred to Gehlbach12 for more detail.

Correlation versus regression If the relationship between two variables is such that when one variable (independent variable) changes, the dependent variable changes linearly, such as can be demonstrated with continuous data, the variables are said to be correlated. Figure 18-1 shows the relationship between the height and the weight of students in a graduate class. In this instance, the height is plotted on the x-axis, where the independent variable is usually plotted. This is the variable selected or controlled by the investigator; the term independent in this usage does not connote statistical independence. Weight is plotted on the y-axis, which is where the response (also called outcome or dependent) variable is typically plotted. The graph reveals a roughly linear relationship.

Regression A main purpose of using correlational data is to make predictions, so it would be desirable to have a mathematic model to describe the relationship. One approach is simple linear regression that represents the relationship as follows: y = β0 + β1x + ε, where y is the predicted value of the response variable (weight), and x is the height (predictor or explanatory variable). β0 (a constant, in particular the intercept: the value where the regression line intersects with the y-axis) and β1 (regression coefficient, ie, slope, change in y per unit change in x) are population parameters. ε is the error term that records the deviation of an observed data point from the value predicted by the regression equation. For this example, the regression line has parameters β0 = –144.43 and β1 = 1.235. One advantage of having a mathematic model of this type is that it makes it possible to predict the expected value of y for any value of x. Such prediction is the bread-and-butter use of regression. In fact, typically regression analysis takes the values of x as stated (as if there were no measurement error) and minimizes the sums of squares on the vertical component (ie, the y-axis).

Correlation coefficient In contrast the correlation coefficient measures the degree of linear correlation between x and y, considered as equal partners.6 For continuous data, such as blood pressure, dollars, temperature, height, and weight, the degree to which two variables are related is given by the correlation coefficient (r), which can take values between –1 and 1. A value of r = 1 indicates a perfect

261

Brunette-CH18.indd 261

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

y = 1.235x – 144.430 R2 = 0.69602

Fig 18-1 Simple linear regression of weight on height for a sample of graduate students. The residual, that is, the deviation of an observed point from the value predicted by the linear equation, is shown for two points. The equation is y = (slope × x) + intercept, where y is the weight (kg) and x is the height (cm). In this case, the slope is 1.234 and the intercept is –144.430. Thus, y = 1.235x – 144.430. The points for women tend to be less than those predicted by the model, suggesting that sex might be considered as a predictive variable in a multiple regression model. The coefficient of variation, R2 = 0.696, indicates that almost 70% of the variation in weight among the sample can be explained by height. (Reprinted from Brunette3 with permission.)

positive relationship, and a value of r = –1 represents a perfect negative relationship. A value of r = 0 indicates that there is a lack of a linear relationship. Colton15 provides a very crude rule of thumb: r values between 0 to 0.25 indicate little or no relationship, 0.25 to 0.50 indicate a fair degree of relationship, 0.5 to 0.75 a moderate to good relationship, and those above 0.75 a good to excellent relationship. The formula for calculating the sample correlation coefficient follows: n

r=

√(n∑

n

n

n ∑ i=1 x i y i – ∑ i=1 x i ∑ i=1 y i n i=1

2

n

n

2

n

x i – ( ∑ i=1 x i)2 )(n∑ i=1 yi – (∑ i=1 y i)2 )

where n is the total number of samples, xi (x1, x2, . . . xn) are the x values, and yi are the y values. The formula can be presented in several different forms by rearranging the terms. Correlation coefficient calculators are available online (eg, www.alcula.com/ calculators/statistics/correlation-coefficient). Typically, researchers test whether there is a significant relationship between the two variables by demonstrating that the value of r significantly differs from

0. While a correlation coefficient always takes on a value between 1 to –1, a regression coefficient can take on any value.

Residuals Also plotted in Fig 18-1 are the residuals for two observations. The residuals can be thought of as the error between the observation and the value predicted by the model (ie, the least squares fitted line). The pattern of residual values is used to diagnose the assumptions of the linear regression model, including the normality assumption on the errors. Regression analysis also can be used to explain the data with regard to the amount of variability in the data that is attributable to the independent variable. An interesting property of the correlation coefficient r is that r2 (called the coefficient of determination) represents the amount of variation in the outcome variable that is explained by the predictor variable. For the height-weight relationship shown in Fig 18-1, r2 = 0.7, so 30% of the variation is unexplained, and therefore it might be prudent to investigate the data in greater detail.

262

Brunette-CH18.indd 262

10/9/19 11:54 AM

Cross-sectional Survey

In the height-weight data, the correlation coefficient differs significantly from 0. In this sample, taller people tend to weigh more. However, the values for women are more likely to fall below the regression line than are those for men. Moreover, when the data are considered separately for men and women, the regression lines for the two sexes have different slopes. This situation, in which more than one independent (or explanatory) variable has an effect on the dependent (response or outcome) variable, is common and can complicate the process of forming conclusions. A number of approaches can be used to control for this effect. One is restriction, that is, limit the subjects in a study to a homogenous group. For example, one could limit the height/weight study to either sex male or female. Another is stratification, where the investigator excludes subjects from the study who are outside a certain range of a variable, for example, studying only those who are underweight or overweight. Such measures may limit complications, but they also affect the external validity if the intent is to generalize the findings to the general population.

Confounding variables Another problem that can obscure the relationship between a disease and exposure is confounding. A confounding variable (confounder) is defined as a risk factor for the disease that is associated with the exposure (the variable of interest in a study) and thereby contributes to the observed association between the exposure and the disease. Smith and Philips16 concluded that it is likely that many of the associations identified in epidemiologic studies are due to confounding, often by factors that are difficult to measure. Confounding is often offered as an alternative explanation for an association between an exposure and a disease. There are several criteria for determining a confounding variable: (1) There should be an association of the confounder and the disease among the unexposed individuals; (2) there is a difference in the distribution of the confounding variable among the different exposure groups; and (3) the confounding variable should not lie on the causal pathway between the exposure and the disease. Confounding results from imbalances in risk factors for the outcome (ie, disease) in different exposure groups. Confounding muddies the association of a risk factor with a disease and can either increase or diminish the effect of the exposure variable. Confounding is often detected by comparing the discrepancy between the crude and adjusted estimates

for relative risk. If adjusting for a potential confounder changes the relative risk, the variable is suspected to be a real confounder.

Adjustment In epidemiologic studies, the possibility that confounding could account for observed associations is managed by demonstrating that the association is independent of the confounding factor.16 An exposure is said to be independently associated with the outcome if the association remains after the levels of other exposures are held fixed; this process is called controlling, adjusting, or conditioning for other exposures. Two major approaches to adjustment are (1) multiple linear regression, which is used when the outcome variable is continuous, such as blood pressure, and (2) logistic regression, when the outcome variable is binary, such as presence or absence of disease or an event such as tooth loss, death, or heart attack. Multiple linear regression is a statistical technique for predicting the value of a continuous and normally distributed dependent variable (also described as response or outcome variables) from a combination of two or more independent variables (also described as risk factors or explanatory or predictor variables, depending on the purpose of the study). For example, multiple linear regression might be used to examine the relationship of an Oral Health Index value (response variable) to age, race, education, and sex (predictor variables). Similar to simple linear regression, the goodness of fit may be measured by the coefficient of determination, R2, where R (note R is uppercase and is called the multiple correlation coefficient) in this case is a measure of correlation between the outcome variable and the joint set of independent variables. Meanwhile, each of the independent variables has its own regression coefficient; in our example, we used β1 for height. We might use β2 for daily caloric intake, β3 for another variable thought to affect weight, and so forth. A problem occurs with the units as the coefficients refer to change per unit of the independent variable, and these change. The solution is to standardize the coefficients by converting each variable to have a mean of 0 and standard deviation of 1, and the resulting regression is done against these standardized scores for each variable. The resulting coefficients are called standardized regression coefficients or beta weights. Multiple regression can be set up so that the model includes covariates. The covariates are variables that the investigator thinks may be important and thus

263

Brunette-CH18.indd 263

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

may want to control, such as socioeconomic variables. The reason for including such variables in the analysis may not be to determine their separate effects, which could already have been established, but to enable the investigator to determine the effects of other independent variables by adjusting for the covariates. Multiple regression can also be used to examine interactions among the independent variables, for example, to determine if the effect of a person’s sex on Oral Hygiene Index depends on age. Still another problem is that often one wants to determine whether a particular independent variable really adds much information to the performance of the regression model. The solution to that question is to run the analysis with and without the independent variable and note if it has added in a meaningful way to the proportion of variance explained. This is often done in a stepwise manner where all the predictor (ie, independent) variables are introduced one at a time, and their effect on the multiple correlation is determined. Still another problem now rears its ugly head. How do you determine the order in which you introduce the predictor variables? A number of strategies are employed, such as order of the variables in terms of variance explained or a predetermined order specified by the investigator who may have a theory or other rationale (hierarchal regression), but here we are entering the territory of need for specialized knowledge, and for most readers of this book, the preferred approach would be to get the advice of a statistician. Logistic regression is the technique used to explore questions involving binary outcomes (yes or no), such as the relationship of chronic periodontitis to heart attack. Like multiple regression, logistic regression can include consideration of multiple dependent variables and yields the regression coefficients for the predictor (explanatory or independent) variables. The results are interpreted as the probability that the outcome variable will occur in response to a variable of interest, after adjustment for the effect of the other (independent) variables. The results are usually presented as an OR. Logistic regression enables the researcher to compare two conditions, such as presence and absence of chronic periodontitis (CP), adjusted for other explanatory variables, on a binary outcome, such as erectile dysfunction (ED), and to obtain an adjusted OR. In the instance of CP and ED, the authors of a study investigating the relationship used logistic regression to find an adjusted OR of 3.29 (95% confidence interval of 1.36 to 9.55),17 a result that was somewhat different but included in the range of the value estimated from the simple 2 × 2 table analysis. The result was different

because the more sophisticated logistic regression analysis adjusts for potential confounders, including age, body mass index, household income, and education level. The focus of logistic regression lies in calculating ORs. When newspaper articles use phrases such as, “When adjusted for socioeconomic variables, people with periodontal disease are three times as likely to suffer from erectile dysfunction,” odds are that the underlying research used logistic regression.

Limitations of statistical adjustment Statistical adjustment has limitations. Gilbert18 stated that if the data are observational rather than experimental, the greatest caution must be used when multiple regression is used in an attempt to identify those independent (explanatory) variables (x’s) that determine the response value (y). Indeed, in his view, observational data can never guarantee that any observed relationship is cause and effect. Several problems are connected with statistical adjustment: 1. Errors of measurement of x tend to dilute the true size of the functional relationship (but means of dealing with measurement error, such as regression calibration and simulation and extrapolation, are being developed).19All the important independent variables (x’s) must be included in the analysis. If a missing but important x is correlated with some of the other independent variables, those independent variables may well appear as more important than they actually are. 2. The assumption that the effects of the independent variables are linear and additive must hold.

Important problems related to calculation and interpretation of association/correlation Misclassification and exposure measurement error In general, it is easier to detect meaningful associations when the measurement of exposures and diagnosis of disease are accurate, but sometimes the methods used to measure the exposure or to classify the disease are unreliable. Misclassification is related to the sensitivity and specificity of the classification method. Janket et al20 found that there was a significant underestimated

264

Brunette-CH18.indd 264

10/9/19 11:54 AM

Ecologic Study

relative risk for coronary heart disease in studies that used self-reported periodontal status. Dietrich and Garcia21 suggested that, if a real effect exists, the association between periodontal disease and coronary heart disease may be stronger than that suggested by currently available cohort studies. One approach to deciding whether the criteria adopted for classification have unduly influenced the conclusion is to conduct sensitivity analysis, in which the researcher monitors the changes in the estimate of risk if classification criteria are varied.

Bias As used in epidemiology, bias is an error in design or execution of a study that produces results that are consistently distorted in one direction because of nonrandom factors. In contrast, chance (random) distortions tend to cancel each other out. Even when studies are carried out with rigorous methodology, P values and confidence intervals are largely concerned with determination of sampling error, not the problem of bias, which can greatly impact the results and interpretation. The P value is the probability of obtaining a test statistic at least as extreme as the one actually observed, assuming that the null hypothesis (no difference) is true. Values of P < .05 indicate that groups are significantly different, that is, any difference present cannot be explained by chance. Sackett22 listed 56 biases for clinical studies; however, because biases can arise from subtle sources, the actual number of possible biases to be considered is probably limited only by the imagination of the investigator. An important skill for epidemiologic investigators is the ability to identify conditions that lead to bias of whatever type in their studies. A common problem is that those individuals who are available for selection into a sample do not represent the target population (those people for whom the findings of the study are meant to generalize). For example, patients receiving treatment in a clinic set in an undergraduate dental faculty are unlikely to represent the general population. Some study designs are associated with particular types of bias. For example, study designs that require participants to recall exposure to particular agents are subject to recall bias, because people who have a disease are more likely to have thought more deeply about their past types of exposure than have healthy controls. Bias can also appear as a result of changing conditions over time. As methods of cancer detection improve, patients may be diagnosed at an earlier stage

of their disease, when there is greater proportion of slow-growing tumors. Thus, survival time after diagnosis may improve irrespective of changes in methods of treatment because of both the greater lead time resulting from earlier diagnosis and the greater proportion of patients with slow-growing tumors. Finally, bias can also arise because of the complex interrelationships of variables that can occur in causal pathways.

Sample size Regression analysis is an example where problems can arise because the sample size is too large or too small. A large sample size can lead to independent variables having an effect that is statistically significant but not biologically or clinically significant, whereas too small a sample size may result in impressively large correlations that cannot be replicated in other studies. Norman and Streiner23 recommend a rule of thumb whereby the sample size or number of data points is at least five times the number of independent (ie, predictor) variables. As noted earlier, a large number of independent variables tends to favor fishing expeditions with many possible comparisons, and the investigator, not accounting for the number of comparisons being made, may be deluded into accepting as significant one that was the result of the action of chance alone.

Ecologic Study Ecologic studies attempt to establish the relationship between exposure to a factor (eg, fluoride in the drinking water) and an outcome variable (eg, caries) by using aggregate data. P(1) → %E (and notE) %OV (and notOV) P(2) → %E (and notE) %OV (and notOV) where P(1) = population in a geographic or social unit (country, province, hospital) no. 1; P(2) = population in a geographic or social unit (country, province, hospital) no. 2; %E = percentage of subjects exposed (eg, to water fluoridation); and %OV = percentage having a given outcome variable (eg, dental caries). The groups are often defined geographically (eg, Toronto, Vancouver), and the investigator does not know the relationship between exposure and outcome variable on an individual-by-individual basis. The

265

Brunette-CH18.indd 265

10/9/19 11:54 AM

Hatted cats (%)

Black cats (%)

A

20

20

B

40

40

Group

C

D

40

60

a

40

Hats (%)

CORRELATION AND ASSOCIATION

60 40 20 20 40 60

60 b

Black cats (%)

Fig 18-2 Example of an ecologic study showing that well-correlated variables are not necessarily linked. (a) Percentages of hatted cats and black cats in four groups. (b) Graphic representation of data showing a perfect correlation coefficient of 1. (Adapted from Rosen et al.25)

major advantage of the ecologic study is that it is inexpensive; the data are usually available.24A significant disadvantage is that, although we know how many people are exposed and how many have the outcome, we do not know how many of the exposed people have the outcome. It is possible that unexposed people exhibit the outcome—as is illustrated in Fig 18-2. The proportion of black cats and the proportion of hatted cats is given for four populations (see Fig 18-2a); but blackness and hattedness are not necessarily linked in individuals. The ecologic design is very susceptible to the problem of the hidden variable. Correlations based on individuals are almost always lower than ecologic correlations. In our pictorial example of black and hatted cats, Cohen’s kappa equals 0.33, even though the correlation coefficient is a perfect 1.0 (see Fig 18-2b). Widely used in cancer research, ecologic studies can be constructive in forming hypotheses that could prove useful for directing future research.25

Case-Control Design In the case-control design (Fig 18-3), investigators first examine people who already have a certain condition (the cases) and look for characteristics they share, such as exposure to an agent in the past. A valuable feature of the case-control approach is that investigators

can test a variety of possible causes that led to the disease or condition of interest. It is quite a flexible and inexpensive means of exploring possibilities and particularly powerful when searching for the causes of a new problem. Case-control design is also practical for studying conditions or diseases that take a long time to appear; lung cancer is a classic example. An investigator could compare lung cancer incidence in a group of nonsmokers with a group of smokers, but it would likely take a long time before any difference is evident. Case-control–like approaches are informally used in problem-solving situations both in everyday life and in the laboratory. For a problem with contamination of cultures, an investigator might examine the history of the cultures to search for the factors that the contaminated ones held in common, such as batch of media, type of culture dish, source of antibiotic, technician who initiated the cultures, and other factors. In fact, the investigator would systematically go through all the components of the culture system. The characteristics chosen for examination as possible causes for a disease or condition will probably be influenced by the theories currently being used to explain the condition. Patients with juvenile periodontitis would be examined for the presence of a particular type of bacteria or for unusual characteristics of their leukocytes. If the cases share certain characteristics (eg, if the juvenile periodontitis patients have defective leukocytes), then an explanation of

266

Brunette-CH18.indd 266

10/9/19 11:54 AM

Case-Control Design

Cases Factor A = a Factor B = b

Risk factors were present Odds for cases Risk factors were absent

Compare

Controls Factor A = a Factor B = b

Risk factors were present

Odds ratio

Odds for controls

Risk factors were absent

Fig 18-3 Case-control design. In this highly efficient design, the study begins after the event (ie, the “causes”) are known. The factors (A, B) for which the cases are matched may vary from the specific to the general, such as being treated in the same hospital. Many possible causal conditions (risk factors) can be assessed.

the condition is suggested. The problem is that many people who do not have the condition might also share the same characteristics; a comparison group is needed. In the case-control design, the comparison group (the controls) is often selected so that it resembles the cases as closely as is possible. When we consider the many ways that people can be matched—by age, sex, race, religion, dietary habits, human leukocyte antigen (HLA) type, and other traits—it becomes clear that it is difficult to get a perfect match. Without a perfect match, there is always an alternative explanation. In Fig 18-3, the cases and controls have been matched for two factors, A and B. So, for example, A could be sex, and a in the diagram could stand for male. B could be age, and the value b in the diagram could stand for age 60 to 65 years. The controls would also be 60- to 65-year-old males. An investigator can overmatch. If the cases and controls are matched by age, sex, race, or any other factor, the matched factors can no longer be evaluated as etiologic agents, because they will be equalized in the cases and the controls. Sometimes case-control studies are not done with specific matching factors; there might be a sample selected from all the people who did not have the condition. Two additional problems with the case-control design relate to information bias, that is, shortcomings in the way the information is obtained or processed. The ideal is for there to be no major differences in quality or availability of the data between the cases

and the controls. This is difficult because of biased recall. People who have unpleasant conditions may recall the past quite differently from nondiseased individuals; sick people have the tendency to think deeply about what caused their problem.12 Information bias is not necessarily detrimental to interpretation. Sometimes investigators can devise approaches that can—if not overcome it—possibly determine its relative unimportance. Elwood26 examines an early and important study on the relationship of smoking and lung cancer27 to illustrate how bias can be assessed. As the data indicated that smoking was related to lung cancer, the problem might be that the cancer patients (who may well have wondered what caused their condition) might overreport their smoking, or, alternatively, the controls might underreport their smoking. The investigators examined the records of subjects who were suspected of having lung cancer at the time of the interview but whose final diagnosis was something else. These non-cancer patients, who at the time of interview thought they had cancer, reported smoking patterns similar to the non-cancer control patients, indicating that reporting bias may not have been an important issue. This example teaches us that while some investigative designs may be susceptible to certain problems, creative investigators can often find ways to assess just how much the problems are affecting their conclusions and can work out ways to minimize them.

267

Brunette-CH18.indd 267

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

Second, case-control studies often utilize data that were collected in the past under uncertain conditions. It is difficult to get reliable bias-free information for such factors as exposure to asbestos, smoking habits, fluoride ingestion, and therapeutic drug usage. Moreover, in any given study, some of the most important data may not have been collected, because it would clearly be difficult to assess in advance what information will be of interest in the future. The collected information may be of questionable reliability, because standardized techniques for collecting the data may not have been practiced. This point is sometimes overlooked. Some years ago, our clinic director (at the time) asked the faculty council for permission to destroy some outdated records. In the ensuing debate, a number of faculty argued that this action would amount to destroying a potential treasure trove of data that could be used for research. However, considering how the data were obtained, it is improbable that much really useful data would be lost. The students responsible for the treatments had different diagnostic, technical, and record-keeping skills. They had never been calibrated to any standard, and it is likely that the standards changed, because there was considerable faculty turnover during the period. To date, those debated but probably useless records have not been used for research purposes. Gehlbach12 recommends that a reader ask three questions when considering the controls in a case-control study: 1. What sort of population do the control subjects represent (ie, do they behave like people in the general population)? 2. Are there likely to be relationships between the control population and the factors under study that would influence the results? 3. Was matching used appropriately? As in the other correlational designs, there is still the problem of the hidden variable. For example, exposure to some agent may be correlated to some other factor that is correlated with the outcome (eg, socioeconomic status). Despite these difficulties, there are merits in the case-control design. In comparison with other, more rigorous methods, case-control design is very efficient. For example, one way that a dentist could study juvenile periodontitis would be to characterize patients completely for a number of properties and follow these patients over time (ie, a follow-up design). The dentist could then compare the patients who developed

juvenile periodontitis with those who did not. There would be no problem with obtaining accurate data, because the data collection could be completely under the dentist’s control. There would be little problem with selection of a control group, because it would comprise the patients who did not get the disease. However, there would be a problem in the number of cases of juvenile periodontitis; some dentists never encounter a single case. The dentist might end up with all controls and no cases. The follow-up design might be more rigorous than the case-control design but is much less practical. Gehlbach12 believes that the case-control design is ideally suited for initial exploratory investigations. Analysis of case-control designs is best done by calculating the odds ratio. Simon28 provides a detailed discussion of this point.

Follow-up (Cohort) Design A cohort is a group of people who share a common characteristic (eg, year of birth, place of employment, or exposure to a risk factor such as radiation). Sometimes considered the crème de la crème of observational studies,12 the follow-up or cohort design (Fig 18-4) begins with a study population free of the condition or outcome of interest. The population is then measured or classified on the basis of the characteristics of interest. At the start of the study, investigators can gather information that is as complete as they have resources or desire to acquire without the problems of accuracy in data gathering associated with the case-control study. The investigators then observe the subjects repeatedly over time and note when and if the outcome of interest occurs. In fact, they can measure multiple outcomes. Traditionally, cohort design has referred to groups formed on the basis of exposure to some agent, but the same principles are applied when the exposure is a treatment or other intervention. The outcome is correlated with the properties measured or recorded initially, as it may be necessary to adjust for confounders. In other words, the exposed and nonexposed groups might differ in important properties that may be relevant to the outcome. In particular, cohort studies are vulnerable to selection bias. Volunteers for the intervention would be expected to differ from nonvolunteers in characteristics such as compliancy with treatment and education. Confounding would be expected in cohort design studies, because there is no randomization process that equalized the groups before treatment or exposure. Moreover, there may

268

Brunette-CH18.indd 268

10/9/19 11:54 AM

Scientific Standards in Correlational Experiments Involving Humans

Exposed over time

Cohort sharing a common characteristic

Compare

Odds/ risk rate

Not exposed over time

Check for subgroups (eg, low exposure) Recalculate for subgroups Adjust for confounders

Fig 18-4 Follow-up (cohort) design used to study effects of a particular exposure. Members of a cohort differ in important characteristics (such as age). The data may be stratified into subgroups and analyzed separately or adjusted for confounding by methods such as Mantel-Haenszel.

be some strata or gradient of exposure that might also require adjustment. Normand et al29 provide a guide to analytical methods used to adjust for confounding, and new methods are available that attempt to adjust for unmeasured confounding—an impressive feat of statistical legerdemain.21 Cohort studies come in different flavors. Figure 18-4 illustrates a prospective cohort study, but cohort studies can also be done using data collected in the past (retrospective) or using a combination of data collected in the past as well as over time. The key feature is that the subjects are selected for the groups on the basis of exposure without knowing the outcome at the time. In other words, the subjects are free of the outcome at the time their exposure status is defined. This approach obviously has more power than the cross-sectional design for elucidating cause-effect relationships, because the patients are classified before the outcome is observed. A thin child who became fat could be tested to see which changed first, the child’s level of activity or the child’s weight.12 Follow-up studies are susceptible to selection bias, because various factors tend to select the subjects observed at the end of the study. Most studies will lose some subjects as people die, move, or simply quit the study. Investigators should compare the dropouts or other lost subjects to those who remain to determine whether they differ in such a way that the outcome could be affected.

Another problem is that people can change their habits from the beginning to the end of the study, so that smokers could become nonsmokers and drinkers could become teetotalers. Sometimes subjects may change their habits because they are in the study. Technical problems of this design include the following: (1) it may be difficult to retain control of the therapy; (2) blindness among subjects (see chapter 19) and consensus may be difficult to achieve; (3) it may violate some statistical tests based on the assumption of randomization; (4) for rare disorders, large sample sizes or follow-up are required, and the study can be expensive.7 An example of a highly successful cohort study, the Dunedin study, is discussed in chapter 16.

Scientific Standards in Correlational Experiments Involving Humans Epidemiologic studies have led to splendid achievements, including demonstrations of a dietary deficiency leading to pellagra, the association between cigarette smoking and cancer, the protective dental effect of fluoridated water, and the role of thalidomide in phocomelia. But epidemiology is also plagued with

269

Brunette-CH18.indd 269

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

controversies. In one survey, fifty-six different postulated cause-effect relationships were found in which epidemiologic studies contradicted one another.30 Perhaps because of this plethora of conflicting evidence, there is now widespread skepticism about the continuing stream of reports implicating such common items of daily life as eggs, coffee, or sugar as menaces to health. Feinstein31 has argued that epidemiologic studies often lack the precautions, calibrations, and relative simplicity that are taken for granted in experimental science. To rectify this problem, he advocates the application of five scientific standards commonly assumed in experimental research to epidemiologic studies: 1. High-quality data. Persons should be directly examined with methods that can be carefully calibrated for their reproducibility and validity. Epidemiologic data are often not of high quality. The clinical diagnosis of the outcome disease in the cases is usually accepted as stated, although, on occasion, investigators may check for false-positive errors by reviewing the available diagnostic evidence. However, in the control group, which is chosen because the target disease was not diagnosed, evidence of the disease’s absence is almost never verified, and members of the control group do not receive the appropriate diagnostic tests. Another problem is that exposure, caseness, or both depend on recall, which is fallible.32 2. A stipulated research hypothesis. Epidemiologic studies generate vast quantities of data. The advent of highpower computing enables these data to be dredged for numerous statistical associations. However, associations gleaned in this way do not have the acceptability associated with data that conform to a previously stipulated research hypothesis. 3. A well-specified cohort. Each person included in a study should be checked for suitable eligibility for the study, and each person should be accounted for thereafter. 4. Avoidance of detection bias. Disease should be sought with equally intense methods of surveillance and examination in the exposed and nonexposed groups. 5. Analysis of attributable actions. The complexity and erratic nature of human exposure to multiple agents makes it difficult for epidemiologists to define exposure to an agent as precisely as would be desired. In the future, Feinstein concludes, investigators will have to focus more on the scientific quality of the evidence, and less on the statistical methods of analysis and adjustment.31

Some Potentially Serious Problems in Determining Causation from Observational Data Residual confounding Statistical modeling can only work perfectly when all the measures in a model are measured perfectly. Residual confounding can occur when a confounder has not been adequately adjusted in the analysis, as is the case when the measurements of the confounding variable are inaccurate. Residual confounding can be suspected when there appears to be obviously spurious associations. Risk factors tend to cluster; for example, in a study involving more than half a million people on meat intake and mortality, 33 those who consumed the most red meat were heavier, smoked more, were less educated, and consumed more calories and fat and fewer fruits and vegetables than those who consumed less red meat. Sophisticated statistical modeling led the researchers to conclude that red meat had an independent effect on mortality. However, a more detailed examination of the model by Sainani34 found that it produced some other statistically significant but implausible associations. For example, death by infectious disease could not be plausibly explained by red meat consumption, and this spurious association indicated that the model has some unexplained (residual) riskiness that was unaccounted for. Such residual confounding leads to spurious association. For example, if the amount of smoking were underestimated, some of the mortality that was actually attributable to smoking might be captured by some other variable that was associated with smoking, perhaps red meat consumption, and make that variable appear (mistakenly) to be independently associated with mortality. Because the risks for some implausible associations were similar to that found for consumption of red meat, Sainani34 concluded that the study did not provide evidence of a causal relationship between consuming red meat and death. As noted earlier, a likely cause of spurious association is confounding. Associations reported in observational studies but not confirmed in randomized controlled trials (RCTs) tend to be associations of exposures that are related to many socioeconomic and behavioral measures that are in turn related to disease.35 On the basis of an observational study,36 hormone replacement therapy (HRT) was originally thought to protect

270

Brunette-CH18.indd 270

10/9/19 11:54 AM

Some Potentially Serious Problems in Determining Causation from Observational Data

against CHD because the risk ratio was calculated at 0.39. Because of that impressive result, an RCT on HRT was undertaken, but the results were disappointing, to say the least; the risk for CHD was found to be substantially and significantly higher for participants taking HRT than untreated controls.37 It is noteworthy that the original study did not measure a number of important potential confounding variables, such as exercise, use of cholesterol-lowering drugs, and ethnicity, and residual confounding is now thought to be the reason for the aberrant relationship first reported.37 The RCT does not suffer from residual confounding because the randomization process distributes possible confounders equally among the groups. Moreover, in the original study, HRT was found to confer almost as much protection against accidental or violent death as against CHD.38 There does not appear to be a plausible biologic link to explain the association of HRT with accidental death, so it is possible that both associations were confounded with other risk factors. Beral et al38 concluded that, “Previous claims that HRT protects against CHD should now be discounted.” HRT users tended to be healthier and more affluent and tended to engage in healthier lifestyle habits, and observational studies examining the beneficial health effects of HRT are prone to selection bias. Socioeconomic status, a variable that is important but difficult to measure, influences multiple health outcomes and can be a factor in residual confounding. What may be of particular importance in oral health epidemiology is general health awareness. Hujoel et al39 found a dose-dependent association between two causally unrelated oral (lack of dental flossing) and general lifestyle (obesity) characteristics. They noted that good oral health may be the result of factors related to general health awareness rather than simply oral self-care patterns. They argued for more consideration of the hypothesis that general health awareness factors influence both oral and systemic diseases.

Some general rules for interpreting observational studies Some general rules for interpreting results from observational studies have been developed. First, the size of the effect should be considered. Sainani34 suggested that in observational studies, when the ORs are small to moderate (ORs between 0.6 and 1.6 for a binary predictor), the observed effects could be entirely due to residual confounding when there are strong confounders in play. One of the major criticisms of

many studies relating periodontal disease to cardiovascular disease has been that they adjusted for smoking (a strong confounder) by the use of multivariate analysis, an approach that is open to bias due to residual confounding. Second, when the OR (or other risk measure) for the relationship being investigated is similar in value to ORs found for obviously implausible relationships in the same study, it becomes possible that the relationship may be a case of spurious association (eg, dental flossing and obesity). Third, an obvious limitation of adjusting for confounding is that the confounder needs to have been measured and reported. Retrospective analysis of some data sets may be hampered because variables that are now thought to be important might not have been so considered when the study was done and thus were not measured.

Dubious assumptions Stephen Gordon, an economist at Laval University, writing in the National Post (Canada), warns, as many before him have done, that correlation is not causation. Gordon, however, describes the reasons that predictions by economists may not be accurate and why economists often disagree among themselves.40 In brief, economists most often deal with observational data, and to interpret it they have to make specific assumptions about how the world works, that is, specify a model with a theory of causality. In the complex world, this often comes down to identifying some possible restrictions and closing off some of these possible causal relations, for example, some of the possible interactions between variables. Critics of the study may challenge these assumptions, but then the burden of proof lies with them to come up with a better set of plausible restrictions. If the two models come to differing conclusions using the same data, it is the assumptions rather than the data that are guiding the conclusions, and this raises the possibility that an economist who favors a certain conclusion can devise a model to support it. In the best seller Freakonomics, the respected economist Steven Levitt and his co-author New York Times reporter, Stephen J. Dubner, asserted that a significant fraction of the United States crime reduction in the 1990s was linked to the change in the abortion law brought about by Roe v. Wade decision in 1973.41 The basic causal mechanism proposed was that the increase in abortion disproportionally eliminated

271

Brunette-CH18.indd 271

10/9/19 11:54 AM

CORRELATION AND ASSOCIATION

potential future criminals. The counterfactual argument that those unwanted babies, had they been born, would have experienced unfortunate circumstances that would have predisposed them to committing crimes is superficially plausible; the absence of these potential criminals 15 to 20 years later would lead to less crime than would have been expected had they been present. However, over the 15 to 20 or so years between the Roe v. Wade decision and the decline in crime, many other changes in social conditions, such as widespread availability of other fertility control technologies, that might plausibly have effects on the number and propensity to crime of individuals living in the 1990s had also taken place. Further events, summarized by Manzi42 showed the fragility of the legalized abortion–crime decline conclusion. The academic basis behind the argument was a study by Donahue and Levitt43 that included three regression models. As would no doubt not surprise Stephen Gordon, other academics published alternative versions of the same analysis using slightly different assumptions and did not see any effect. Although other arguments were also involved, most of the academic discussion centered on the regression analyses, and Manzi42 states that “it became a standard battle of the dueling regressions.” In reply to the criticisms, Levitt and Donahue argued that their preferred assumptions were superior and should be used. Then two critics noted that the implementation of the software used in their study had an error and once that error was corrected, as well as some other technical changes made, the asserted effect of abortion on crime was no longer evident. Levitt and Donahue agreed that an error had been made, but once they fixed it and rejigged their model and data sets, they were able to see the effect again. Therefore, the situation was that the old assumptions were defensible and had in fact been defended by Levitt and Donahue. The new assumptions were also defensible. A cynic might feel that Levitt and Donahue were altering their methods to arrive at their favored conclusion. Other studies sometimes found an effect and sometimes did not. Manzi42 concludes that regression analysis cannot tell us the effects of abortion on crime because different reasonable assumptions for the analysis lead to completely different answers. In a series of three articles, a columnist for The Economist44 reviewed aspects of the shortcomings of the discipline of economics using among other examples the Donahue and Levitt paper discussed earlier. A pattern was noticed in which striking new findings may be promoted by institutions, like-minded interest groups,

and politicians, and the information or misinformation, as the case may be, is further amplified by social media. Moreover, conflicting results and corrections are often ignored. The columnist ended his essay with a cautionary conclusion that applies widely across disciplines: Being alert to the shortcomings of published research need not lead to nihilism. But it is wise to be skeptical about any single result, a principle this columnist resolves to follow more closely from now on.44

References 1.

2. 3.

4.

5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

15. 16.

17.

18.

Larkin PA. Notes for Biology 300 (Biometrics). A Handbook of Elementary Statistical Tests. Vancouver, BC: University of British Columbia, 1978:131. Tufte ER. Beautiful Evidence. Cheshire, CT: Graphics, 2006:159. Brunette DM. Causation, association, and oral health–systemic disease connections. In: Glick M (ed). The Oral-Systemic Health Connection. Chicago: Quintessence, 2014:13–48. Keene HJ, Shklair IL, Anderson DM, Mickel GJ. Relationship of Streptococcus mutans biotypes to dental caries prevalence in Saudi Arabian naval men. J Dent Res 1977;56:356–361. Abramson JH. Making Sense of Data. Oxford: Oxford University, 1988. Gilbert N. Biometrical Interpretation. Oxford: Clarendon, 1973. Streiner DS, Norman GR, Blum HM. PDQ Epidemiology. Toronto: Decker, 1989. Willemsen EW. Understanding Statistical Reasoning. San Francisco: Freeman, 1974. Norman GR, Streiner DL. PDQ Statistics. Toronto: Decker, 1986:96. Huff D. How to Lie with Statistics. New York: Norton, 1954 Hume DA. Treatise on Human Nature (1739–1740), ed 2. Oxford: Clarendon, 1978:173–175. Gehlbach SH. Interpreting the Medical Literature. Lexington: Collamore, 1982:39–71. Davies HTO, Crombie IK. Tavakoli M. When can odds ratio mislead? BMJ 199;316:989–991 Yip JK, Cho SC, Francisco H, Tarnow DP. Association between oral bisphosphonate use and dental implant failure among middle aged women. J Clin Periodontol 2012;39:408–414. Colton T. Statistics in Medicine. Boston: Little Brown, 174:211. Smith GD, Phillips AN. Confounding in epidemiological studies: Why “independent” effects may not be all they seem. BMJ 1992;305:757–759. Oğuz F, Eltas A, Beytur A, Akdemir E, Uslu MÖ, Güneş A. Is there a relationship between chronic periodontitis and erectile dysfunction? J Sex Med 2013;10:838–843. Gilbert N. Biometrical Interpretation. Oxford: Clarendon, 1973:27–45.

272

Brunette-CH18.indd 272

10/9/19 11:54 AM

References

19. Kuchenhoff H. Misclassification and measurement error in oral health. In: Lasaffre E, Feine J, Leroux B, Declerck D (eds). Statistical and Methodological Aspects of Oral Health Research. Chichester: Wiley, 2009:279–294. 20. Janket SJ, Baird AE, Chuang SK, Jones JA. Meta-analysis of periodontal disease and risk of coronary heart disease and stroke. Oral Surg Oral Med Oral Pathol Oral Radiol Endod 2003;95:559– 569. 21. Dietrich T, Garcia RI. Associations between periodontal disease and systemic disease: Evaluating the strength of the evidence. J Periodontol 2005;76:2175–2184. 22. Sackett DL. Bias in analytic research. J Chronic Dis 1979;32:51– 63. 23. Norman GR, Streiner DL. PDQ Statistics, ed 3. Hamilton: Decker, 2003:61. 24. Streiner DL, Norman GR, Blum HM. PDQ Epidemiology. Toronto: Decker, 1989:48. 25. Rosen M, Nystrom L, Wall S. Guidelines for regional mortality analysis: An epidemiological approach to health planning. Int J Epidemiol 1985;14:293–299. 26. Elwood JM. Critical Appraisal of Epidemiological Studies and Clinical Trials. Oxford: Oxford University, 1998:29–32. 27. Doll R, Hill AB. Smoking and carcinoma of the lung; A preliminary report. BMJ 1950;2:739–748. 28. Simon S. Odds ratio versus relative risk. http://www.pmean. com/01/oddsratio.html. Accessed 11 July 2019. 29. Normand SL, Sykora K, Li P, Mamdani M, Rochon PA, Anderson GM. Readers guide to critical appraisal of cohort studies: 3. Analytical strategies to reduce confounding. BMJ 2005;330:1021– 1023. 30. Mayes LC, Horwitz RI, Feinstein AR. A collection of 56 topics with contradictory results in case-control research. Int J Epidemiol 1988;17:680. 31. Feinstein AR. Scientific standards in epidemiologic studies of the menace of daily life. Science 1988;242:1257. 32. Sackett DL. Evaluation: Requirements for a clinical application. In: Warren KS (ed). Coping with the Biomedical Literature. New York: Praeger, 1981:123.

33. Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: A prospective study of over half a million people. Arch Intern Med 2009;169:562–571. 34. Sainani K. The limitations of statistical adjustment. PM R 2011;3:868–872. 35. Smith GD, Ebrahim S. Data dredging, bias, or confounding. BMJ 2002;325:21–28. 36. Grodstein F, Stampfer MJ, Manson JE, et al. Postmenopausal estrogen and progestin use and the risk of cardiovascular disease. New Engl J Med 1996;335:453–461. 37. Rossouw JE, Anderson GL, Prentice RL, et al. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the Women’s Health Initiative randomized controlled trial. JAMA 2002;288:321–333. 38. Beral V, Banks E, Reeves G. Evidence from randomised trials on the long-term effects of hormone replacement therapy. Lancet 2002;360:942–944. 39. Hujoel PP, Cunha-Cruz J, Kressin NR. Spurious associations in oral epidemiological research; the case of dental flossing and obesity. J Clin Periodontol 2006;33:520–523. 40. Gordon S. Correlation is not causation is not a magic phrase that wins every argument. Consider this a warning. https://nationalpost.com/opinion/stephen-gordon-correlation-is-not-causationis-not-a-magic-phrase-that-wins-every-argument. Accessed 8 January 2018. 41. Levitt SD, Dubner SJ. Freakonomics: A Rogue Economist Explores the Hidden Side of Everything. New York: Harper Collins, 1969. 42. Manzi J. Uncontrolled. The Surprising Payoff of Trial-and-Error for Business, Politics and Society. New York: Basic Books, 2012:110–117. 43. Donahue J III, Levitt SD. The impact of legalized abortion on crime. Q J Econ 2001;116:379–420. 44. Many results in microeconomics are shaky. The Economist. Published 26 April 2018. https://www.economist.com/finance-andeconomics/2018/04/26/many-results-in-microeconomics-areshaky.

273

Brunette-CH18.indd 273

10/9/19 11:54 AM

“ “

19

Argument is conclusive . . . but it . . . does not remove doubt, so that the mind may rest in the sure knowledge of the truth, unless it finds it by the method of experiment.” ROGER BACON 1

Experimentation

T

he Persian poet Ibn Yamin said there are four types of men3:

One who knows and knows that he knows . . . His horse of wisdom will reach the skies. One who knows, but doesn’t know that he knows . . . He is fast asleep, so you should wake him up!

Reports that say that something hasnʼt happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we donʼt know we donʼt know. And if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult ones.” DONALD RUMSFELD2

One who doesn’t know, but knows that he doesn’t know . . . His limping mule will eventually get him home. One who doesn’t know and doesn’t know that he doesn’t know . . . He will be eternally lost in his hopeless oblivion!

Independent and Dependent Variables A shorthand model for a simple experiment follows: Tx O S

R C O

where S = selection of eligible subjects; R = randomization, an essential difference between experiment and quasi-experimental designs; Tx = treatment or manipulation of the independent variable; C = control; and O = observation of the dependent (or response) variable. Ideally, an experiment is designed so that all properties, apart from those under investigation, are held constant. A property fixed in this way is commonly called a parameter. In Great Scientific Experiments, Harré4 notes that many classic experiments depended on the skill

274

Brunette-CH19.indd 274

10/9/19 11:59 AM

Uncontrolled

Controlled versus uncontrolled or uncontrollable variables In designing experiments, an investigator is often faced with an array of variables, some of which can be controlled and others of which cannot. The variable selected as an independent variable must be controllable. In a study on the effectiveness of dentifrices, it is fairly easy to control the type of toothpaste used by the subjects but difficult to control their diets. The traditional golden rule applied in such situations appears to be to control what can be controlled and to measure (or observe) the rest. The measurements of the uncontrolled variables might prove useful in a retrospective analysis and help explain unexpected results.

60

Load (N)

of the experimenters in fixing parameters. For example, caries incidence may be related to host factors, dietary factors, and microbiologic flora, and, to study any one of these factors, an investigator must consider the effect of the other variables. The variables under study are the dependent and independent variables. The factor, treatment, or variable manipulated by the experimenter is called the independent variable. The particular attribute measured in response to changes in the independent variable is called the dependent variable. Consider the experiment5 in Fig 19-1 in which the investigator applied a known load to a periodontal ligament and measured the amount it stretched (ie, the extension). The applied load is controlled by the experimenter and is therefore the independent variable. The extension of the ligament is the response to the load and is therefore the dependent variable. This figure is unusual, because the independent variable is normally plotted on the abscissa (x-axis) with the dependent variable plotted on the ordinate (y-axis). The dependent variable in one investigation may be the independent variable in another. Hence, in this example of periodontal ligament extension, it is possible that another investigator would stretch the ligament various distances and record some response (eg, the orientation of cells in the collagen fibers). The independent variable may be only one among many conditions that can affect the dependent variable.

40

20

0 0

0.2

0.4

0.6

Extension (mm) Fig 19-1 A known load (N) was applied to periodontal ligament, and the extension (in millimeters) was measured. (Reprinted with permission from Atkinson and Ralph.5)

Uncontrolled In his book Uncontrolled6 Manzi embraces randomization as the solution to the problem of control in causally dense situations, such as in many business problems. Traditionally such problems might be approached by regression analysis and historical data, but three problems—namely omitted variable bias (as Rumsfeld noted, you don’t know what you don’t know, and not knowing an important variable and omitting it, compromises the model), high-order interaction effects, and variable intercorrelation—present enormous difficulties. Similarly, in biology and medicine investigators cannot know all the causes of a state of health in an individual and so cannot hold all of them constant. A number of individuals have employed experimentation including the prophet Daniel (thoughtfully discussed by Neuhauser and Diaz7). The initiation or development of randomization to assign subjects to experimentally treated or control groups as well as other methods to ensure comparisons of like with like has been developed by numerous investigators,6–8 including medical scientists (eg, Hill, Bell), philosophers (eg, Pierce), and statisticians (eg, Fischer). Randomization ensures that even though the investigator may not know all the factors influencing the response to a treatment, both the known and unknown factors will be equitably distributed between the experimental and control groups so that any effect noted after treatment is the result of

275

Brunette-CH19.indd 275

10/9/19 11:59 AM

EXPERIMENTATION

the treatment and not to some preexisting difference between the groups. It can be appreciated that this makes the experiment easier to run than attempting the venerable method of Lind in his experiments on scurvy, namely to control many variables by matching them between the groups (“their cases were as similar as I could make them”). Manzi6 states that in the last 60 years the randomized field (or controlled) trial has driven out alternative means of determining therapeutic efficiency wherever it can be applied practically and has become the gold standard for evidence of a causal relationship. Although randomization is conceptually simple, it is not necessarily easy to implement, and methods to create unbiased comparison groups and avoid selection biases have evolved over time.8 The simplification afforded by randomization (as compared to control of variables by matching and the like) enables the more widespread use of trial-and-error approaches to problem solving. Consider the Development Office at the University of British Columbia (UBC) attempting to decide whether to use blue envelopes or gold envelopes (UBC’s colors are blue and gold) in the solicitation letters to our alumni for donations. Traditionally such a decision might be made by discussion within a committee. The Blues might point to the beneficial calming effect of blue on alumni, who might have been irritated by too many requests. The Golds would emphasize that gold signals the quality of a UBC degree. Other committee members might request a study of the historical response rates to appeals enclosed in a variety of colored envelopes as well as the control white version. Still another member might suggest that, of course, in such a study the dollar value of donations should be adjusted for inflation. Still another might point out that the donations credited to female graduates should be adjusted for the male/female wage gap. After this erudite academic discussion, there might or might not be a unanimous decision, but at the end of the day when one color was chosen there would be no real knowledge if it was the best choice. So to save itself trouble and get a definitive result the committee might decide to simply do a trial and error experiment—randomly select alumni to receive blue or gold envelopes and see which group donated more money. There would be no evidence for a particular causal pathway related to the choice, but the Development Office would know which envelope worked best. Manzi6 points out that this sort of approach is widely used in business today. He reports that credit card company Capital One reportedly runs over 60,000 tests per year,

whereas Google ran some 12,000 randomized experiments in 2009, with 10% of them resulting in business changes. Running such tests is simple, cheap, and scalable to millions of responders and can discover predictive rules that are sufficiently nuanced to be of practical use in the very complex environment of realworld human decision making.9 The emphasis is placed on executing many fast, cheap tests in rapid succession, rather unlike the medical model of randomized controlled tests involving multiple centers with rigid protocols and extensive administration. It is an evolving strategy, and like evolution itself, it is based on trial and error (through genetic mechanisms of mutation and recombination by chromosome crossover) and definitive assessment (as in survival of the fittest). Trial and error for exploring problems can be a very effective and accessible approach and is used widely in everyday life. For example, often people will attempt to solve a problem with a sluggish computer by shutting it down and then restarting it, perhaps while muttering the word “reboot” to impress nearby listeners with their technical sophistication. Indeed, the approach is even accessible to my Airedale Terriers. The purpose of an Airedale’s life is to have fun, so typically an Airedale such as my dog Pepper and his predecessors will bring a ball to me and toss it at my feet to invite me to play. But if I ignore the ball invitation, he will return with another toy, such as a squeaky toy, to see if that catches my interest. If that fails, he will bring another type of toy, such as a tug toy. Eventually my resistance wears down, so Pepper finds trial and error to be an effective strategy. But it is rarely seen in dental research, so our emphasis here is on the approaches to experimentation seen in articles published in dental journals.

Requirements for a Good Experiment An experimenter must make decisions regarding factors such as definitions, sampling, experiment design, measurement, statistics, and generalization.10 In a good experiment, these decisions are made satisfactorily. Some characteristics of a good experiment in biologic and clinical research follow.11

276

Brunette-CH19.indd 276

10/9/19 11:59 AM

Requirements for a Good Experiment

Adequate controls The value of controls is that they eliminate plausible alternative hypotheses. A good experiment has adequate controls.

Difficulties The most common method of induction is Mill’s method of difference, which requires the experimental group to differ from the control group with respect to only one variable. However, obtaining a one-variable difference between groups is difficult to achieve. Consider the apparently simple problem, outlined by Heath,12 of studying the effect of different concentrations of potassium ion (K+) on cell cultures. For such an experiment, it is unfortunate that K+ is only available in combination with an anion such as chloride (Cl–). Thus, to change the K+ concentration, an experimenter would also have to change the concentration of the anion. Second, in varying the K+ concentration, the experimenter also varies the ratio of K+ to all the other cations and anions in the culture medium. The effects of changes in these other variables are said to be confounded in this experiment; in other words, if an effect were found, it would be impossible to tell which of the changed variables (K+, anion, ratios) caused the effect. Heath12 stated that “a treatment applied in an experiment is never simple in the sense that it alters only one factor,” and, moreover: The limit to the number of such confounded effects for any experimental treatment is set only by our knowledge and powers of imagination. An experiment in which the application of treatments to the material under investigation has been properly randomized can yield an unbiased estimate of the effects of those treatments as applied; it gives no information as to which of the myriad components in any treatment comparison are responsible for the effects observed. That is a question of interpretation and is entirely a matter for the experimenter’s judgment. The reader of a scientific paper may not share the author’s interpretation and may propose that the effect was really produced by one of the confounded variables. Investigators have some options when confronted with confounded variables or concomitant effects. The first, and weaker, alternative is to quote the literature; in this example, investigators would need to reference

papers stating that changes in Cl– concentration had no effect on culture growth. However, relying on other work may be dubious because there are inevitable differences between different experimental systems. These differences weaken the form of inductive argument used by the investigator in this situation (ie, argument by analogy). The better option is to design an experiment so that the particular variable causing concern is no longer confounded. A significant difference between experiments in physics and chemistry and those in biology and medicine is that the variables that must be controlled in the physical sciences are often known and can be controlled. In studying chemical reaction rates, we know that we must control the temperature, but we do not have to worry about the phase of the moon. In contrast, in psychiatry, there are a vast number of factors that might influence behavior, including a subject’s upbringing, genetic composition, biochemistry, and even—if apocryphal tales of correlation between criminal activity and a full moon are true—the phase of the moon. Moreover, in physics and chemistry, there are often laws that describe the relationship between variables so that an investigator can make valid comparisons between observations that were made under different conditions. In biology and medicine, no such precise laws are available. Thus, in physics and chemistry, although the equipment required must be precise and sophisticated, experiment designs are often, but not always, simple in comparison with the elaborate designs used in psychology and medicine. Technical or conceptual sophistication does not cancel the need for controls. The challenge becomes devising appropriate controls when new investigational tools are developed. For example, antisense oligodeoxynucleotides (ODNs) can be used for the specific manipulation of gene expression. In theory, inhibition of gene expression is brought about when ODNs bind a complementary messenger RNA (mRNA) sequence and prevent translation of the mRNA. However, some early studies employing antisense ODNs gave erroneous results because the investigators did not realize that there could be nonspecific effects of ODNs and that ODNs did not necessarily enter the cell freely. The outcome was that additional controls were required to measure antisense effects and to differentiate them from nonspecific effects. Such controls included direct measurement of target RNA, careful choice of control sequences, and demonstration of cell permeability to the ODNs.13 In summary, there was a development of knowledge about antisense ODN experiments that necessitated additional controls. These additional

277

Brunette-CH19.indd 277

10/9/19 11:59 AM

EXPERIMENTATION

controls were recognized as being necessary only after the limitations and problems of the antisense ODN experiments were realized. This example shows that choosing appropriate controls requires a detailed and comprehensive knowledge of the biologic system under investigation. It is difficult to have a well-controlled experiment in biology or medicine. It is a creative act to determine which of the many possible differences between the untreated control group and the treated (best thought of as the “treatment as applied”) groups could be important and to devise appropriate control groups that eliminate plausible alternative hypotheses. Thus, it should be no surprise that many experiments published in the literature do not approach this standard.

Use and abuse of controls in the medical and dental literature Studies on the use and abuse of controls have long appeared in the medical literature.14 From such reviews, the following is clear: 1. Controls are often absent in reports of clinical trials in the medical literature. It seems that even the elementary principles of experimentation and the thoughts of John Stuart Mill are not adequately appreciated by biomedical investigators. 2. Experts vary in their estimate of the percentage of reports judged to be adequately controlled. There could be many reasons for this variation, including the journal and time period studied, as well as the reviewer’s standards. For example, Patterson15 asserts that therapeutic regimens used in diseases whose known courses are quite constant require less rigid controls for evaluation, while Mahon and Daniel16 believe that “probably the only clinical situation where no controls are necessary is the treatment of disease which is universally and rapidly fatal.” Exactly what constitutes an adequate control is not a simple matter. A true control is one in which all the relevant variables (save the putative one being tested) are identical between the experimental and control groups. The number of relevant variables known depends on the state of knowledge of the particular area of science at a particular time. There is little that investigators can do about this problem except to randomize their groups. In other cases, a relevant variable may be identified or known to be operative, and the problem becomes how investigators perform controlled experiments. Experiment design will be discussed later (see chapter 20).

3. Real controls are useful. In a review comparing the results of controlled versus uncontrolled studies, Spilker17 found that the success rate (ie, the proportion of papers that claimed that their therapy was useful) was much higher in uncontrolled than in controlled studies. In other words, authors whose studies lacked controls could not detect that other factors were at work that would explain their results. Spilker notes that his review supports “Muench’s second law: Results can always be improved by omitting controls.”17

Negative, positive, and active controls In biologic experimentation, investigators examine the effects of biologic response modifiers (BRMs), which may be drugs, growth factors, changes in environmental conditions, genotype, or anything that might influence the components of a biologic system. However, biologic assay systems can exhibit day-to-day or batch-to-batch variation. In cell-culture experiments, experiments might not work on some days because some sort of contaminant, such as endotoxin, killed the cells or rendered them otherwise unresponsive. Such potential problems can be dealt with by incorporating positive and negative controls.

Negative controls A negative control is synonymous with most people’s idea of a control and is easily understood. With the negative control, the biologic response is observed in an untreated group. The purpose of this control is to set up a direct comparison so that—as John Stuart Mill would have liked—the experimental and control groups differ in only one important factor: the treatment.

Positive controls Equally important but less practiced is the positive control that shows that the observed system is capable of response. In cell culture, a positive control for an experiment investigating a suspected growth factor would be to include a group in which the cells were exposed to a known growth factor. If the suspected growth factor did not produce any growth, but neither did the positive control, a technical problem would be suspected, and the effect of the suspected growth factor would be tested in another experiment. Without the positive control, the suspected growth factor might not be investigated further. Another value of positive

278

Brunette-CH19.indd 278

10/9/19 11:59 AM

Requirements for a Good Experiment

controls is that they can identify an assay system in which no increased activity can be observed. The latter condition is sometimes called the ceiling effect (ie, when you are at the ceiling, there is no way to go higher). The frugal scientist might think that the effort placed in performing positive controls is wasted if the experimental treatment also produces a positive result. But such is not the case; even in the successful experiment, the positive control will enable the investigator to assess the potency of the BRM relative to that of the positive BRM in the control. Taken over time, positive controls also provide an estimate of assay variation by retrospective analysis. In short, positive controls are extremely useful and should be incorporated into more experiments.

Active controls Sometimes it is not possible to perform a negative control in clinical studies. Suppose an investigator believes that a current therapy, while not optimal, does have some beneficial effect. Because there is a need to produce a better treatment, experimentation is required. Ethics dictate that the investigator could not withhold beneficial treatment to any patient, so there cannot be a group of untreated or placebo-treated controls. In such situations, it may be permissible to treat one group with the experimental BRM and the comparison group with the current best treatment. Those receiving the current best treatment are called the active controls; these subjects are really a type of positive control. The desired result is for the experimental BRM to outperform the active control. The other interpretable result is if the experimental BRM, is less effective than the standard treatment. The problem occurs when there is no difference between the active control and the experimental BRM, because then it is not clear whether either treatment was effective. In such a case, the standard problems in interpreting negative results occur because there are many possible explanations for no effect. It is possible that the detection system was not sensitive enough, that the treatment was inadequate, that compliance with the treatment was inadequate, or that there was heterogeneity in those receiving treatment. If the population is heterogenous with regard to response, the results will depend on the proportion of responders and nonresponders in the groups.

Adequate measurement and statistics Absence of systematic error For a clinical experiment, the absence of systematic error means that if the experiment were performed with a large number of subjects, it would give an accurate estimate of the treatment effects. In practice, this often means eliminating bias in the selection of the sample. In comparing Americans’ heights with Canadians’ heights, it clearly would be invalid to measure the heights of American men and compare them with the heights of Canadian women; the systematic difference (in this case, sex) between the groups invalidates the comparison. In the parlance of experiment design, sex and nationality are confounded. If a significant difference occurred, the investigator would not know whether to attribute the difference to sex or nationality. The second component of an experiment that must be free from systematic error is the scoring or measurement procedure. In the case of measurement, the instruments or techniques used must be accurate; that is, they must give a true indication of the value being measured. A third possible source of systematic error could occur if one group is examined more (or less) thoroughly than the other, as may happen when one group is hospitalized and the other is not.

Sufficient precision If there is no systematic error, the estimate of the effect of treatment or variable will differ from the true value only by random error. A major challenge in designing experiments is to ensure that the random errors are not so large that they obscure the effects of the independent variable of interest. A number of techniques are used to increase the precision of an experiment, such as refining methods or increasing the number of subjects.

Calculation of uncertainty Every measurement is subject to variation caused by factors that cannot be controlled by the experimenter; this variation is called random error. Sensitive instruments can illuminate random error. The last digit on a digital pH meter will often change in the course of taking a measurement; one moment it will read 7.40; the next instant, 7.39; then, 7.41; and so on. Because it is known to exist in any experiment, the random error should be estimated. Otherwise, a critic could attribute the results to chance. To overcome such a criticism, investigators must have some estimate of the

279

Brunette-CH19.indd 279

10/9/19 11:59 AM

EXPERIMENTATION

uncertainty in their results. When the random error is known, the investigators can assess the statistical significance of their data; that is, they can assess the probability that any difference between a treated and a control group was not just the result of chance.

A wide range of validity As a general rule, the primary goal of experiments is to increase our understanding of nature and to improve our ability to control and predict events. But our ability to apply the knowledge gained from experiments to other situations is limited because practical considerations dictate that only a small range of conditions can be examined. The more widely the experiment encompasses relevant variables, the greater the confidence we have in the extrapolation of the conclusions. Cox11 believes that having a wide range of validity is particularly important in experiments performed to decide some practical course of action, and less important where the object is purely to gain insight into some phenomenon. Case reports lack a wide range of validity. As a general rule, they are based on examination of a few individuals (perhaps even a single person) exhibiting unusual symptoms. The conclusions of case reports, if any, tend to have a very limited applicability. Case reports are seldom cited. Thus, if we accept the argument that citation frequency is a good indicator of the importance of a finding, the low citation frequency of case reports indicates that a wide range of validity is a condition for an impactful result.

Simplicity Simple experiments are preferred to complicated ones because they are more likely to be executed correctly. Moreover, simple designs yield data that are easier to analyze. Nevertheless, complicated factorial designs can be more efficient when a wide range of conditions must be investigated. Hence, the investigator has to weigh the relative merits of simplicity, technical feasibility, cost, and range of validity.

Originality Originality would appear to be a self-evident criterion, for the goal of all research is to make discoveries. Rewards in science are overwhelmingly allotted to

scientists who demonstrate originality—that is, those who first arrive at a particular result.18 At the simplest level, originality means that nobody has performed exactly the same experiment before. However, an investigator who only did experiments that met that simple requirement would not normally be classified as possessing an original mind. Original investigators have the ability to ask questions and perform experiments that add to our understanding. Such truly original work interests other investigators who will subsequently cite it because their own experiments will be based on it in some way. Unfortunately, as Medawar19 states, there is “no such thing as a calculus of discovery.” Indeed, creativity has been analyzed from the perspectives of historians, biologists, psychologists, artists, and poets, but there is a continuing question of whether creativity can be learned. Root-Bernstein20 has produced a book that exemplifies scientific discovery even as it describes it. However, creativity can be abetted, Medawar21 believes, “by reading and discussion and by acquiring the habit of reflection guided by the familiar principle that we are not likely to find answers to questions not yet formulated in the mind.”

Types of Research Beveridge22 has discussed and classified types of experimentation in his book The Art of Scientific Investigation: 1. Pure or basic research is done to gain knowledge for its own sake. 2. Applied research is a deliberate investigation of a problem of practical importance. Lewis Thomas is said to have commented that the difference between basic and applied science is the element of surprise that can occur in basic science.23 3. Exploratory research opens up new territory; however, some kinds of exploratory research are quasi-experimental. Anatomists are not experimenting in the strictest sense of the term when they dissect animals, but still they gather information by a manipulative act that would not normally take place in nature. 4. Developmental research consolidates advances (also known as pot-boiling). 5. The transfer method of research applies an ordinary fact, principle, or technique from one branch of science to another, in which it may become novel. In Beveridge’s opinion, this is the most fruitful and

280

Brunette-CH19.indd 280

10/9/19 11:59 AM

Tactics of Experimentation

easiest method in research. From my experience on various granting committees, it seems also to be a very popular approach. According to Harré,4 the most common type of experiment is probably the measurement of some variable property under variable conditions. Robert Boyle’s experiments on the relationship between pressure, volume, and temperature exemplify this kind of study to perfection.18 They also illustrate—for Boyle had no effective way of controlling temperature—that in the real world, few processes are so simple that they comprise only a cause and effect, with no other variables entering the relationship. The next most common experiments are those that attempt to link the structure of something found in an exploratory study to the processes going on in that structure.4 Less common but perhaps most important are experiments that attempt to test a theory by proving the existence of something not previously identified in the real world. The search and isolation of cytokines and growth factors whose existence has been postulated but not verified exemplifies this type of study. Both Beveridge22 and Conant24 have likened research to warfare against the unknown and have outlined some useful tactics. On the psychology of discovery in biologic science, Beveridge advises use of the following tactics: 1. The investigator should follow up clues until the trail is exhausted; workers who change their problems repeatedly are usually ineffectual. 2. From time to time, the investigator should deliberately put the problem out of his or her mind for a while to get a fresh approach to the problem, a process Beveridge refers to as the principle of temporary abandonment. 3. The investigator should look for analogies between the problem presented and others that have been solved. 4. In publishing scientific papers, the investigator should present the results accurately but should only cautiously suggest an interpretation to distinguish clearly between facts and interpretation. 5. Discovery and proof are distinct processes. Discoveries are made by giving attention to small clues or discrepancies in data. Initially, a hypothesis formed on such a basis might not stand up to intense critical scrutiny. Such scrutiny should be reserved for later when the hypothesis is being tested.

Beveridge22 recommends the following practical aspects of experimentation: 1. Test the whole before the parts. Before you can study anything, there must be some effect to study. Thus, it is often worthwhile to start with a complex system or extreme conditions of some kind, such as a massive dose of the compound of interest, to establish the effect. 2. After establishing the effect, the most important factors can be determined by systematic elimination. 3. Pilot or modest preliminary experiments are often useful in determining the types of problems that might be encountered and in identifying the approximate range of doses or other conditions that can be investigated in more detail later. 4. The investigator should understand the limitations and degree of accuracy of each technical method to obtain reliable results and to interpret them properly. The most common cause of error, according to Beveridge—and my own experience supports his—is a mistake in technique.

Tactics of Experimentation Experiment tactics can be thought of as the art of making experiments work. The general approach is to devise conditions so that the phenomenon of interest can be seen. The list of tactics used by investigators would be almost endless, for any experiment involves tactical considerations. Three examples are presented here.

Assays and optimization Many of the most-cited papers in biologic research deal with the development of methods to measure some biologic response. Indeed, much of the art of biologic and medical research consists of devising methods to quantify phenomena and to devise conditions where responses to various agents can be distinguished. Choosing the right conditions requires that decisions are made on how an experiment is to be done—a process that can be called experiment tactics. As an example of experiment tactics, we discuss assays here rather than in chapter 12 on measurement. Two general cases will be considered: optimal and suboptimal conditions.

281

Brunette-CH19.indd 281

10/9/19 11:59 AM

EXPERIMENTATION

4

Activity

3

2

Optimal pH Suboptimal pH

1 0

2

4 6 Substrate

8

10

Fig 19-2 Relationship between velocity of an enzyme-catalyzed reaction and substrate concentration.

To examine and better understand a complex process, it is often useful to isolate component phases and to consider each separately. Often, when biologic responses are depicted graphically, they take on a pattern where there is a linear (or nearly linear) portion of a curve and a plateau region. Consider the relationship between the velocity of an enzyme-catalyzed reaction and substrate concentration (Fig 19-2). As the independent variable substrate concentration increases, the dependent variable reaction velocity increases almost proportionately. This is the linear (or, in this case, nearly linear) portion of the curve. At some point, there is a transitional zone, after which the reaction velocity remains constant despite the increase of the independent variable; this is the plateau portion of the curve. Depending on the nature of the problem, either portion of the curve may prove useful. If a biochemist were studying the effect of inhibitors on the reaction, it would be best to use the near linear portion of the curve, because it is the most sensitive portion for determining changes. However, if the goal of the experiment is to detect activity, it would be best to choose a substrate concentration on the plateau portion of the curve, as this is the area where optimal activity is found. Disturbances of the substrate (eg, by competitive inhibitors) would then be minimal. Note that after determining the optimal substrate concentration, a biochemist might determine the optimum pH, ionic strength, or another factor. Thus, biochemists can control experimental variables reasonably well by breaking a complex problem (eg, enzyme activity) into parts (eg, substrate concentration, temperature, pH).

This strategy works only after conditions have been established that allow the biochemist to see some activity. To arrive at this preliminary stage, tactics vary. Some investigators prefer the synthetic approach (adding factors one at a time until they see some response), while others prefer a shotgun approach (throwing everything conceivable into the mix or taking many precautions and, if some response is seen, simplifying later). For example, in developing his method for localized delivery of drugs, Max Goodson explained to me that he placed a periodontal pack on his patients, even though he did not think it was absolutely required. He simply wanted to take all possible steps to ensure that the released drug remained localized. This cautious type of approach is not unusual but can result in needlessly complex procedures if the subsequent step of simplification is not carried out.

Turning a problem into an asset Another tactic—what might be called the jujitsu method—is to turn a problem into an asset. A common approach is the constructive use of variation. Variation, because it makes differences harder to detect and demonstrate statistically, is often viewed by investigators as a nuisance or problem to be overcome. Some experiment procedures, such as blocking (see chapter 20), are employed explicitly to reduce variation so that any experimental effects are more apparent. However, in some instances, variation can be employed productively, for it carries information about the experiment. In 1943, Lüria and Delbruck25 published a paper destined to become a classic of bacterial genetics. It had been discovered that resistant bacterial populations would emerge when exposed to a selective agent such as viruses (phages). The mechanism was not known. Do a few cells of the original population acquire resistance as a result of exposure to bacteriophage, or does the bacteriophage select for preexisting resistant cells? Lüria and Delbruck devised an experimental test to answer the question (Fig 19-3). A broth culture of Escherichia coli was split into two; one half was cultured in a single culture (the mainline), whereas the other was further divided into multiple cultures (sublines). After growth, the cultures were tested for resistance to bacteriophage in replicate cultures. The replicate cultures from the mainline culture showed similar numbers of resistant colonies, while the sublines showed wide variation. This result could be explained by mutations conferring resistance that was random

282

Brunette-CH19.indd 282

10/9/19 11:59 AM

Tactics of Experimentation

Dilute culture of E coli

50 cultures of 0.2 mL (sublines)

Each culture tested individually on phage agar

10 mL mass culture (mainline)

50 replicate cultures tested on phage agar

Variance sublines > variance replicates of mainline

Fig 19-3 Culture of Escherichia coli split into single culture (mainline) and multiple cultures (subline) to investigate the mechanism of bacteriophage resistance.

and spontaneous. If the mutation arose early in the subline culture, the subline would yield a large number of resistant cells, whereas if it arose later, only a few cells in the subline would be resistant. In contrast, the mainline culture would include some cells that mutated early and some late, giving rise to an intermediate number of resistant cells relative to the extremes found in the subline. Note that in this assay Lüria and Delbruck were not interested in the absolute number of bacteria resistant to the bacteriophage, but rather in the variation between cultures. This experiment relied on recognizing the role of chance and using it constructively. In his autobiography,26 Lüria, a Nobel laureate, recalled that he conceived the idea while watching a colleague play a slot machine at a faculty dance. The colleague hit a jackpot. The jackpot was analogous to the cultures that produce high numbers of resistant cells, and it dawned on Lüria that “slot machines and bacterial mutations have something to teach each other.”

Use of inhibitors A frequent approach in studying regulation of biologic processes such as cell proliferation is the use of inhibitors. Typically, inhibitors are used to identify a catalytic or regulatory activity that affects an end process. Many inhibitors are reversible and act conditionally, depending on the inhibitor’s concentration and its duration of exposure to the biologic system. Inhibitor studies can be combined with biochemical and genetic approaches to determine the exact protein that interacts with the inhibitor and the contribution of the protein to the end process. A specific inhibitor should block the proposed process, but not other closely related reactions; for example, an inhibitor of DNA synthesis should not also block RNA synthesis. Failure of an inhibitor to affect a process suggests that the reaction normally modified by the inhibitor is not involved. A common problem in studies using inhibitors is the uncertain specificity or, worse, the known lack of specificity of some inhibitors. In some instances, the specificity is dose-dependent (ie, at low doses, the inhibitor might be specific and affect only one process, but at higher doses, it might affect several). Thus, studies using inhibitors should use the lowest possible concentration that is effective.27

283

Brunette-CH19.indd 283

10/9/19 11:59 AM

EXPERIMENTATION

Tactical Considerations in Clinical Experiments Clinical studies have a host of strategic and tactical considerations, many of which result from the interactions of patients and investigators, and their state of knowledge and ethics. The investigator’s intention also influences clinical research trial design; the investigator might strive to demonstrate efficacy (the ability of the intervention to produce effects in an ideal setting) or effectiveness (what the intervention accomplishes in actual practice). Spilker28 has reviewed many of these studies in his Guide to Clinical Studies and Developing Protocols. Friedman et al29 discuss the fundamentals of clinical trials, with an emphasis on trials demonstrating effectiveness. Some of the key considerations follow.

Outcome measures In clinical studies, the selection of the outcome measure is a key decision. Ideally, it should be easy to ascertain or measure accurately. The outcome measure should be clinically relevant and, ideally, should avoid the problems associated with surrogate variables. The investigators should be able to observe or measure the outcome in all treatment groups. The measure should be chosen before the experiment begins so that investigators may be spared the temptation of choosing the variable that best supports their hypothesis after data are collected. Moreover, the investigators have an ethical responsibility to monitor the safety and clinical benefit of an intervention during a trial, so that if the intervention was showing a clear benefit, the trial could be stopped so that subjects in the control group could benefit. Statisticians have devised stopping rules to determine the appropriate point at which a trial can be discontinued. These methods can entail repeated testing for significance that must adjust for the number of comparisons being made.29,30 Similarly, a clinical trial may have to be discontinued if harmful effects are noted in those receiving treatment or the data are unlikely to show any significant effect. The choice of outcome measure will influence the statistical test used to analyze the results—and thus the number of subjects required to obtain suitable power. The formula for determining the number of subjects for a dichotomous response variable (ie, success or

failure) differs from one used for determining continuous response variables (eg, see Friedman et al29). The general rule is that a clinical trial should have sufficient statistical power to detect clinically interesting differences between groups. Clinical therapies in dentistry are often evaluated in terms of technical performance, such as quality of margins. Technical performance is appropriate in evaluating quality only to the extent that the technical indicator is related to the clinical outcome. Bader and Shugars31 state that the support for the relationship between some technical performance indicators (eg, amalgam polish) and adverse outcomes is often not well documented, for the outcomes of the treatments themselves are not well documented. The breadth and depth of knowledge vary across outcome categories, including such dimensions as physiologic (eg, pain, presence of pathologic states, assessment of function), psychologic (eg, esthetics, satisfaction), economic (costs), and longevity/survival (eg, pulp death, tooth loss, time until repeat treatment). Bader and Shugars31 conclude that dentistry is in an early period of the development of outcomes analysis and that most of dentistry’s day-to-day procedures are rendered in the absence of comprehensive knowledge of their expected results.

Blinding Research has repeatedly found that the clinician’s or patient’s faith in the treatment (or in the physician) can result in a considerable alleviation of symptoms, even when the treatment is a placebo with no active ingredient. The term blind refers to a lack of knowledge of the study treatment. Blinding the investigator and the patient to the treatment is done to prevent the placebo effect from interfering with a study. In an open-blind or unmasked study, both the clinician and the patient are aware of the treatment. In a single-blind treatment, one of them (usually the patient) does not know the treatment, but the other (usually the clinician) does. In a double-blind treatment, neither clinician nor patient knows which treatment the patient received. In some cases, it is difficult to disguise the treatment; for example, any treatment that stains teeth would be obvious to both the clinician and the patient. Spilker28 cites instances where different results were obtained in single- and double-blind studies. For example, patients significantly improved their perceived taste acuity

284

Brunette-CH19.indd 284

10/9/19 11:59 AM

Tactical Considerations in Clinical Experiments

when treated with zinc sulfate in a single-blind study, but not in a double-blind, cross­over study. One method of checking for the effectiveness of the blinding is to ask patients and investigators which treatment the patient received. Clinicians and patients can be quite perceptive in identifying treatments. In a supposed double-blind test of a beta-blocker in heart-attack patients, nearly 70% of physicians and more than 80% of patients correctly guessed whether the patients had received the drug or the placebo.32

Compliance A compliant subject is one who is willing to carry out the procedures specified in the study protocol. Subjects may not wish to comply or may be incapable of complying for many reasons, including factors such as unpleasantness or complexity of the treatment, personality of the subject or the investigators, insufficient motivation, and lack of any discernible effect. Obviously, it is most efficient to enroll only subjects who are likely to be compliant into an efficacy study. Compliance can be assessed by techniques such as counting pills, directly observing and questioning patients, obtaining biologic samples at spot checks, and examining patient diaries kept for this purpose. In our study of the effects of chlorhexidine-soaked dental floss, my graduate student checked compliance by both self-reports and measurements of the actual floss used. Compliance was important to assess in that study because compliance is best obtained with simple interventions, and using floss properly seems difficult for many people. Compliance can be improved by simplifying the demands, minimizing unpleasant events for the patients, allowing frequent contact between the patient and his or her family, and employing other motivational methods.28

Subject recruitment and loss Clinical studies have eligibility criteria that define the type of subject that is required. Obtaining eligible subjects and retaining them is a major problem for investigators, for both these factors can introduce bias into the study. In setting the eligibility criteria, an investigator must strike a balance between obtaining a homogenous population that will enhance the probability of seeing a result and a heterogenous group that will better represent the general population. In general, ease of recruitment is inversely proportional to strictness of the eligibility criteria, but some criteria cannot

be relaxed. Designing effective recruitment strategies requires some consideration of the target population’s characteristics. Perri et al33 found that the placement of advertisements in senior citizen newsletters was the most cost-effective method of recruiting for prosthodontic subjects from that group. Ethics dictates that clinical studies should employ subjects who have the potential to benefit from the intervention; conversely, those who would be harmed by the treatment must be excluded. Patients who are assigned to a group by a randomization procedure but fail to complete the study because they are unwilling or unable to participate are called dropouts. Withdrawals are patients who have entered the study but have been excluded because there is an assignable cause for removing them, such as noncompliance. Spilker28 has reviewed patient sources, factors that influence their recruitment, and techniques to increase recruitment. The most rigorous analysis, in which patients’ results are analyzed in the groups to which they were assigned regardless of whether they complete treatment, is called intention-to-treat analysis. The intention-to-treat approach makes sense for effectiveness studies because it estimates what happens when an intervention is prescribed for a patient. However, for efficacy studies, the goal is to evaluate under ideal conditions, a situation that would normally entail a subject receiving the intervention.

Data quality Much of the effort in performing clinical trials is spent ensuring that the data are collected in accordance with a protocol and are recorded accurately. Investigators must design data collection forms that collect standard information, such as the subject’s name, schedule of treatments received, and values of measurements performed. In addition, the forms often allow for recording abnormal events or adverse reactions, as well as investigator comments. Any laboratory performing the measurement should have standard protocols and quality control measures to determine, for example, if values drift with time. Strategies to deal with unexpected values must be devised. Should measurements be repeated on samples that gave severely abnormal results? Should samples be tested at two separate laboratories and discrepant results checked? What standards should be used for the laboratory, and how frequently should instruments be calibrated? In addition, investigators must standardize procedures for clinical measurements; must prepare detailed

285

Brunette-CH19.indd 285

10/9/19 11:59 AM

EXPERIMENTATION

instructions for subjects, clinicians, and support staff; and must follow protocols carefully. Significant problems in data quality include (1) missing data, (2) inaccurate data, and (3) highly variable data. There are several ways of dealing with these problems; nevertheless, random external audits of clinical trial data29 have found problems with data quality, such as failure of investigators to comply with recruitment criteria (6%), deviations from treatment protocol (11%), and recorded treatment responses (5%).29 Altogether, problems with data quality seem to be sufficient to interfere with research efficiency. Other strategic and tactical considerations for randomized controlled trials are discussed in chapter 19.

Typical Variables to Control or Consider in Biologic Research Each scientific discipline has its own traditions about what variables must be closely controlled. For instance, even at the mundane level of washing glassware, differences exist. In general, chemists are meticulous in this regard, whereas microbiologists (at least before the advent of molecular biology) are typically more lax. Biologists might often be downright sloppy, unless performing cell culture, in which case they are ritualistic. These differences in standards of cleansing exist because the importance of clean glassware depends on the kind of experiment being done. Trace amounts of sulfur compounds could ruin an experiment on chemical catalysis but would likely have no effect on the growth of a hardy bacterium. Thus, for each type of study, there are variables that have been found to be important and that must be controlled. Knowing which variables are important is central to expertise in any discipline, and given the diversity of biologic experimentation, no one could provide an exhaustive list of these variables. Nevertheless, Ingle34 has outlined the major extrinsic and intrinsic variables that affect biologic and medical research; some of these are mentioned briefly here: 1. Genetics. Breeds and strains of animals differ in metabolic behavior in ways that can be crucial to investigators. In general, rodents seem to require the same vitamins as humans. But, unlike humans,

2.

3. 4.

5.

6.

7.

8.

rats synthesize their own vitamin C (do not attempt to make rats scorbutic by placing them on an ascorbic acid–deficient diet). There is a whole industry developed for the production and distribution of inbred strains of animals with specialized features. Sex. Sex differences have been found in metabolism, responsiveness to hormones and pharmacologic agents, susceptibility to pathologic changes, and longevity. Age. The sensitivity of animals to hormones, food factors, and drugs varies with age. Activity. The voluntary activity of laboratory animals and of humans varies greatly among individuals and may be cyclic in the same individual. Thus, it is said that civil servants do not look out the windows in the morning because that is all they do in the afternoon. Muscular activity correlates with age up to sexual maturity and begins to decline when growth has stopped. Emotionality. Ingle34 states that in humans, correlations between emotional behavior and metabolic changes are abundant, but cause-effect relationships have been difficult to establish. Microbiologic pattern. Dentists hardly need any reminder that the pathogenesis of periodontal disease and dental caries depends on the presence of certain microorganisms. Bacteria can also exist in a symbiotic relationship with animals—a relationship that might be adversely affected by antibiotics. There is growing evidence that latent viruses are involved in many diseases. These considerations complicate the interpretation of the responses observed to agents that might exert their influence indirectly by affecting the microbiologic pattern. Environment. Environmental factors that must be controlled include light, temperature, ventilation, bedding materials, population density, juxtaposition of different species, and cage design. Like hospitals, animal-care facilities are now required to be accredited, a process that involves a site visit and a minute examination of conditions. Diet. Navia35 states that a common error in the design of experiments involving animals is the use of diets that are nutritionally and organoleptically inadequate. In such experiments, treatments are applied to animals that are not healthy; consequently, the results will be confounded by the presence of nutritional diseases, which may independently affect the parameters being measured.

286

Brunette-CH19.indd 286

10/9/19 11:59 AM

References

References 1. 2.

3.

4. 5. 6. 7.

8.

9.

10. 11. 12. 13. 14. 15. 16. 17.

Bacon R. Opus Maius. Part IV: Mathematics in the Science of Theology. 1267. Defense.gov. News Transcript: DoD News Briefing—Secretary Rumsfeld and Gen. Myers, United States Department of Defense. 12 February 2002. He will be eternally lost in his hopeless oblivion! https://nafshordi.wordpress.com/2016/07/26/he-will-be-eternally-lost-in-hishopeless-oblivion/. Accessed 1 March 2019. Harré R. Great Scientific Experiments. Oxford: Oxford University, 1983. Atkinson HF, Ralph WJ. In vitro strength of the human periodontal ligament. J Dent Res 1977;56:48–52. Manzi J. Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics and Society. New York: Basic Books, 2012 Neuhauser D, Diaz M. Heroes and martyrs of quality and safety. Daniel: Using the Bible to teach quality improvement methods. Qual Saf Health Care 2004;13:153–155. Chalmers I. Comparing like with like: Some historical milestones in the evolution of methods to create unbiased comparison groups in therapeutic experiments. Int J Epidemiol 2001;30:1156– 1164. Manzi J. City Journal Magazine. What social science does—and doesn’t—know. https://www.city-journal.org/html/what-socialscience-does—and-doesn’t—know-13297.html. Published 2010. Accessed 7 June 2018. Plutchik R. Foundation of Experimental Research. New York: Harper and Row, 1974. Cox DR. Planning of Experiments. New York: Wiley, 1958. Heath OVS. Investigation by Experiment, no. 23, Institute of Biology’s Studies in Biology Series. London: Edward Arnold, 1970. Wagner RW. Gene inhibition using antisense oligodeoxy­ nucleotides. Nature 1994;372:333–335. Ross OB. Use of controls in medical research. JAMA 1951;145:72. Patterson HR. Controls in clinical studies. Lancet 1962;1:90. Mahon WA, Daniel EE. A method for the assessment of reports of drug trials. Can Med Assoc J 1964;90:565–569. Spilker B. Guide to the Clinical Interpretation of Data. New York: Raven, 1986.

18. Meadows AJ. Communication in Science. Toronto: Butterworths, 1974. 19. Medawar PB. The Limits of Science. Oxford: Oxford University, 1984. 20. Root-Bernstein RS. Discovering: Inventing and Solving Problems at the Frontiers of Scientific Knowledge. Cambridge: Harvard University, 1989. 21. Medawar PB. Induction and Intuition in Scientific Thought. London: Methuen, 1969. 22. Beveridge WIB. The Art of Scientific Investigation. New York: Vintage, 1950. 23. Beauchamp GK. The flavor of serendipity. American Scientist; 107:170. 24. Conant JB. On Understanding Science. New York: Mentor, 1951. 25. Luria SE, Delbruck M. Mutations of bacteria from virus sensitivity to virus resistance. Genetics 1943;28:491–511. 26. Lüria SE. A Slot Machine, a Broken Test Tube. New York: Harper and Row, 1989. 27. Pardee AB, Keyomarsi K. Modification of cell proliferation with inhibitors. Curr Opin Cell Biol 1992;4:186–191. 28. Spilker B. Guide to Clinical Studies and Developing Protocols. New York: Raven, 1984. 29. Friedman LM, Furberg CD, DeMets DL. Fundamentals of Clinical Trials, ed 3. New York: Springer, 1998. 30. Elwood JM. Critical Appraisal of Epidemiological Studies and Clinical Trials, ed 2. Oxford: Oxford University, 1998. 31. Bader I, Shugars DA. Variation, treatment outcomes, and practice guidelines in dental practice. J Dent Educ 1995;59:61–95. 32. Byington RP, Curb JD, Mattson ME. Assessment of double-blindness at the conclusion of the beta-blocker heart attack trial. JAMA 1985;253:1733–1736. 33. Perri R, Wollin S, Drolet N, Mai S, Awad M, Feine J. Monitoring recruitment success and cost in a randomized clinical trial. Eur J Prosthodont Restor Dent 2006;14:126–130. 34. Ingle DJ. Principles of Research in Biology and Medicine. Philadelphia: Lippincott, 1958. 35. Navia JM. Animal Models in Dental Research. Birmingham: University of Alabama, 1977.

287

Brunette-CH19.indd 287

10/9/19 11:59 AM

20 Experiment Design

I



But for the moment, let us stop with the idea that the scientist cannot just go off and ʻexperiment.ʼ It takes some long and careful thought—and often a strong dose of difficult mathematics.” DAVID SALSBURG 1

n designing experiments, investigators typically make decisions about five major components: (1) treatments or interventions, which are generally the research interests driving the experiment; (2) experimental units, which might be teeth, mouths, animals, or plots in a field; (3) population/sample, which defines the group about which the investigators want to generalize and from which they must sample; (4) allocation, which is the manner in which the treatments are allocated to experimental units (usually by a specified randomization procedure); and (5) outcome, which comprises the counts or measurements to be made on the experimental units. The precision of an experiment, like the precision of a measurement, indicates the closeness with which the experiment estimates some quantity; it is measured as the reciprocal of the variance per observation.2 If an experiment were changed so that the diet of the subjects was no longer controlled, the variance in the amount of bacterial plaque found in the subjects’ mouths might increase twofold. In that event, twice as many subjects as were used in the experiment in which diet was controlled would be required to estimate the mean for the treatment with the same variance. The efficiency of experiment designs may be considered the number of experimental units required to gain a certain precision. The relative efficiencies of two experiment designs are given by the ratios between their precisions.2 Because most investigations operate under financial constraints, efficiency is an important concept because it relates directly to the cost of the experiment. The greater number of experimental units used in a less efficient design will add to the cost. Experiment design is a complex subject that has received comprehensive treatment in numerous textbooks.2–6 Only some of the simpler designs and more essential issues will be considered here.

Managing Error A fundamental goal of experiment design is to minimize the effects of error. In dealing with error, the investigator has four basic options: (1) avoid the error, (2) distribute the error, (3) measure the error, or (4) make the error relatively small.

288

Brunette-CH20.indd 288

10/9/19 11:59 AM

Managing Error

Avoiding sources of error Avoiding the sources of error is the usual approach in the physical sciences and is most often accomplished by identifying relevant variables and improving methods to bring the variables under control.

Distributing error If the source of error cannot be eliminated, it may still be possible to arrange the experiment so that the particular error is distributed equally among the groups in the experiment. This principle was first applied in agricultural science to divide fields and partition the treatments into different areas. Indeed, much of the terminology used in experiment design reflects this agricultural heritage (eg, experimental units are called plots). Several of the designs described in this chapter are effective because of the manner in which they distribute the error.

absorbance of the sample solution is then calculated as: A = A sample – A blank = abc. A second problem occurs when another molecule that interferes with the measurement is also in the cuvette. For example, in the Lowry method for measuring protein, the protein is reacted with Folin reagent in the presence of a buffer containing Cu2+ ions. However, the Folin reagent absorbs some light. In this instance, the blank is not distilled water but the reaction mixture without added protein to control for the absorption of the Folin reagent. Another issue concerns the varying apparatuses used to measure absorbance; there are different kinds of spectrophotometers, which have different advantages. Dual-beam models are used to get around the problem of instability of the light source. Although such variability would not be a serious problem for routine measurements, it might be a significant source of error if very small differences were measured. In any case, the purpose of this example is to show that measuring or correcting for the error is a common and useful procedure that can be applied at various levels of sophistication.

Measuring error Some errors are unavoidable and cannot be distributed. In such cases, the investigator must measure the error and report it or correct for it.

Example 1: Measurement of absorbance Among the first scientific laws formulated were the laws describing the absorption of light by solutions. Using a spectrophotometer to measure the amount of light a solution absorbs enables an experimenter to estimate the concentration of a chemical within that solution. This measurement is common in biologic and chemical laboratories, and the various problems associated with it are well recognized. In brief, the amount of light absorbed by a sample of solution contained in a cuvette is measured with a photocell, according to the Beer-Lambert law. This law states that absorbance A = abc; where a is a constant for the material in solution, b is the path length in the cuvette, and c is the concentration. The concentration c can be calculated after measuring A (since a and b are known). One possible error is that some light could reflect off the walls of the cuvette and other interfaces and never reach the photocell. It is common practice to correct for this error by measuring it. In this case, the correction involves measuring the absorbance of a cuvette filled with water. This control is called a blank. The

Example 2: Correction via internal controls or standards In the previous example, adjustments were made to the system from the outside (ie, by using a blank or a dualbeam instrument that could be manipulated outside the solution itself). Sometimes this is not enough, and an internal standard or control is used. One problem encountered in measuring radioactivity by liquid scintillation is quenching. When quenching occurs, some of the bursts of light that would normally be produced and counted are blocked by interfering substances or conditions. Ideally, the experimenter would remove the interference, but often this is not possible because the sample itself is responsible for the quenching. To circumvent this problem, a known amount of radioactivity is added to the sample. We can then calculate what the new (increased) count should be and compare it with what the count actually is. Suppose the observed counts per minute (cpm) of a sample were 200. One could add 100 cpm of a radioactive standard to the sample. The expected value would equal the sample plus the radioactive standard (ie, 200 + 100 = 300 cpm). If the new observed value = 250 cpm, 50 cpm were lost. Therefore 50 out of 100 means that 50% of the counts of the radioactive standard were lost. Thus, it seems likely that 50% of the counts in the sample were also lost when the original sample was counted. Furthermore,

289

Brunette-CH20.indd 289

10/9/19 11:59 AM

EXPERIMENT DESIGN

because one half of the counts was lost, the original value for the sample should be multiplied by 2. The corrected value would be 200 cpm × 2 = 400 cpm.

Example 3: The reconstruction experiment Investigators often wish to demonstrate that the effect of a particular treatment or procedure is negligible. The lack of an effect can sometimes be demonstrated by means of a reconstruction experiment. To test various cavity cleansers, Vlietstra et al7 devised a technique whereby a broth containing Micrococcus luteus was dripped into the cavity during preparation. The cavity was then treated (eg, with Cavilax), and the investigators attempted to recover M luteus from the cavity. However, they discovered a potential problem. During the cavity preparation, an aerosol was created that could recontaminate cavities to give false-positive results on their test. To estimate the magnitude of this potential problem, they did a reconstruction experiment whereby they prepared cavities without dripping the M luteus into them during preparation. These cavities were exposed to the atmosphere in the same manner as the treated groups. Because none of these controls was found to be contaminated, this reconstruction experiment was able to rule out the possibility that those treated cavities that contained M luteus were contaminated after treatment. Reconstruction experiments of this type are extremely useful in dealing with unavoidable errors due to observation or treatment.

Making error relatively small Error can be made relatively small in two ways: (1) increasing the potential difference between treated and control groups and (2) making the error smaller.

Increasing the potential difference between treated and control groups When this strategy is adopted, the investigator tries to perform the experiment under conditions that are maximally favorable for demonstrating a given effect. It is one of the oldest and most effective techniques in the experimenter’s arsenal. To take a historical example, one reason advanced for Antoine Lavoisier’s success in 1775 in discovering the role of oxygen in combustion was that he used materials (sulfur and phosphorus) that demonstrated a prodigious (to use Lavoisier’s term) increase in weight on burning. Other investigators

used metals, such as tin, which increased only by about 25%. With the crude instruments available at the time, Lavoisier could detect the large difference in weight after burning phosphorus, whereas others could not detect the increase in weight of the metals clearly enough to discount the phlogiston theory.8 In biologic research, the potential difference between treated and control groups can often be increased by using a highrisk population. This stratagem ensures that something will likely happen. In testing the effectiveness of oral hygiene procedures on gingival indices, the population under study is often ordered not to brush their teeth for some period of time prior to and/or during the experiment. This stratagem ensures that the subjects will have plaque on which the test procedure can work. If a study tested a mouthrinse on dental students (who often volunteer for experiments of this type) but gave no instructions to avoid teeth brushing, the chances of seeing any effect would be small because the students’ oral hygiene would inhibit plaque formation. The same principle applies to other instances in which investigators select a population that is likely not only to respond to whatever treatment is applied but to respond maximally. The idea behind such a strategy is to increase the signal-to-noise ratio. Thus, if the error remains the same, it is easier to distinguish the effect of a treatment from the error by using a high-risk population that increases the size of the response. Chilton5 notes that, because dental caries formation is most active during childhood and adolescence, the best trial would be run in a group in which these ages are well represented, rather than in an older population in which the caries process is no longer as active. However, the lowest age group studied should be one in which a sufficient number of permanent teeth are present to give a large enough susceptible tooth population. Too many primary teeth in the population makes for added difficulty because of the variability associated with the exfoliation of teeth. In an experiment on the effect of diet on rat caries, the investigator feeds the animals some agent with potential anticariogenic activity, such as xylitol, and measures the number of caries lesions over time. The problem is that with a normal diet, few caries lesions might develop. Thus, the rats are fed a diet containing 20% sucrose and are inoculated with the caries-producing Streptococcus mutans. In this way, an investigator can induce the caries process in 15 of the 16 fissures at risk, and the effectiveness of any treatment can be more readily demonstrated.

290

Brunette-CH20.indd 290

10/9/19 11:59 AM

Some Common Experiment Designs

Making the error smaller The error can be made smaller in several ways. For example, the method can be improved (precise and accurate methods of measurement produce smaller errors and increase the power of an experiment), and the homogeneity of the population can be increased. Homogeneity might be achieved by using experimental units of the same age, sex, genetic background, subculture, state of oral hygiene, or other factors. This will eliminate variability in the observations that arise because of systematic differences in the sample attributable to these factors. However, this method limits the generality of the conclusions. Increasing the sample size reduces the standard error (SE). The decrease in the SE is proportional to the square root of the sample size. Thus, to reduce the SE by a factor of three, we have to increase the sample size nine times. Increasing homogeneity and sample size are simple means to make the error smaller, but sometimes more complex methods are needed to measure and/ or distribute the error.

Some Common Experiment Designs Randomized controlled trial The randomized controlled trial (RCT) is based on Mill’s method of difference and is recognized as a strong research design. In an RCT, a group of experimental units (patients, animals, or cultures) are randomly allocated into two or more groups. The experimental groups receive treatment, while the control group gets nothing (negative control), conventional treatment (active control), or a placebo. In a parallel trial, the groups are treated separately but concurrently as part of the same study. Sometimes, an attempt is made to minimize biases by having neither the subject nor the investigator know which group received experimental treatment until the trial is over (a double-blind trial).

Randomization Randomization is the control technique whereby each subject or object to be measured has an equal probability of being included in a given treatment group. Therefore, the groups to be compared should, on a

statistical basis, be the same at the beginning of the experiment so that any differences observed at the end of the experiment are the result of some treatment. A second reason for randomization is that it removes investigator bias in assigning participants to groups. Randomization is not necessarily easy and is not synonymous with haphazard selection for several reasons outlined by Mainland,6 including the possibility of the purposeful, although well-intentioned, steering of patients to a favored treatment. Nor is it synonymous with alternated assignment (ie, Patient 1 gets treatment; Patient 2 is a control; Patient 3 gets treatment, etc). In general, the use of an alternate subject method of assignment risks confounding sequence and treatment.6 It has been said that nearly every human being has a tendency away from true randomness when making choices. Moreover, such procedures as the “every nth name technique,” which appear to produce random samples, are not reliable, because alphabetical lists contain clusters from racial origins, blood relationships, and even marriage. Current standards of experiment design demand that the method used for randomization be reported. Once the sample groups have been chosen, some characteristic is measured. Because (at least theoretically) the groups are near identical in every other variable prior to treatment, it is concluded that any difference in the treated group is caused by the treatment. Randomization commonly occurs in two basic forms. Stratified randomization entails classifying the subjects on some variable and dividing the subjects into strata (eg, high, medium, and low values for the measurement). Then, for each level, subjects are randomly allocated into the treated and control groups. For many purposes, this is the equivalent of blocking (discussed later in this chapter). Another form of randomized sample selection is adaptive sampling, which can take several forms and can introduce other considerations, such as balancing the numbers in the groups, baseline characteristics, or participant response to a given intervention. However, for all of these considerations, it remains true that a randomizing procedure will be involved in selecting which subjects go to which groups. Along with its virtues, randomization has its problems as a guarantor of making comparisons free of systematic error, which might result because the groups are not equivalent in some important property that helps determine outcome. Chalmers et al9 studied how such determinants are partitioned between groups and concluded that at least one determinant was significantly maldistributed in 14% of trials. The

291

Brunette-CH20.indd 291

10/9/19 11:59 AM

EXPERIMENT DESIGN

possibility of such maldistribution is highest when selections are made from small heterogenous groups. Blair10 discusses several ways to deal with this issue; in particular, minimization is an allocation strategy that seeks to minimize differences in the distributions of outcome determinants between groups.

Bivalent versus functional experiments In a bivalent experiment, only two conditions are used. The t test is the usual statistical test employed in comparing the two groups. The main danger is overgeneralization, because bivalent experiments test a treatment under only very limited conditions. It is possible for the treatment to be ineffective at the particular level or time chosen but effective at higher (or lower) concentrations or at other times. Bivalent experiments do not demonstrate a pattern in the results, nor do they necessarily test the treatment under optimal conditions. In functional experiments, one variable (the independent variable) is given three or more values, and the characteristic of interest (dependent variable) is measured at each level on randomly selected groups. The results of the measurement versus the value of the independent variable are plotted. Functional experiments have several advantages: 1. They test for the effect of the independent variable over a wider range of values. 2. They enable relationships between the variables to be seen. This is particularly important if there is a theory that explains the relationship. In those cases where a hypothesis predicts a certain mathematical relationship between the variables, it is important to remember that, simply because a group of data points fits a certain curve, this curve is not necessarily the only one that suits the data. There are probably other mathematical formulae that would also give a good fit. Gilbert11 notes the following: There are no rules for choosing curves. Choice of an appropriate formula is largely a matter of experience. Where the data show a steady (even though nonlinear) progression, it is usually quite easy to find several different curves, and different mathematicians tend to favor different types of curve. Provided there is no theoretical reason to prefer one over another and provided we do not want to extrapolate along the curve beyond the range of operations, any one of these curves may be adopted.

Thus, before accepting a well-fitted curve as evidence for a particular hypothesis, we must be sure (as always) that an alternative hypothesis does not predict a formula that also fits the data. 3. They enable interpolation (ie, estimation of the values of the dependent variable) between values of the independent variable. Extrapolation (ie, estimation beyond the range of the independent variable) is also more accurate than a bivalent experiment, but this procedure is obviously subject to more error than interpolation. 4. I believe that, like a house of cards—in which the group is more solid than any individual card—several data points demonstrating a pattern of response are much more believable than the evidence afforded by a single point. However, the appropriate statistical tests of the data must be made before the relationship can be deemed significant. 5. There is a problem when the points on the curve are sampled from a potentially heterogenous population. In one part of my thesis research, I examined the effect of an inhibitor on an enzyme found in tissue-culture cells. I found a maximum of 80% inhibition and concluded that this was the most that this enzyme could be inhibited. To my embarrassment, it was pointed out at my oral examination that it was possible that the cells were heterogenous. Thus, my data could be explained by supposing that in 80% of the cells, the enzyme was completely inhibited and in the other 20% of the cells, completely uninhibited. The points on many published curves are, in reality, the averages of samples taken from populations that may be heterogenous. For example, a problem in measuring the growth of children is that there is a change in the rate of change (an acceleration or a deceleration), because different individuals spurt or slow down at different distances from the starting point. The method of averaging can create false estimates of the time and size of growth spurts in children.12 A major practical limitation with functional experiments is that often only one variable can be changed at a time. This, it shall be seen, can be a very inefficient way of finding information. The data are generally analyzed with regression analysis, using the least squares method.

292

Brunette-CH20.indd 292

10/9/19 11:59 AM

Some Common Experiment Designs

Matched-group designs: Overcoming differences in relevant variables Matching When the number of subjects is small and the variation of the characteristic under consideration is large, randomization may not yield truly equivalent groups. Matching techniques perform a dual function of reducing the error and, at the same time, controlling the variable on which the subjects or objects are matched. One problem with the random-groups design is that it is possible, because of random sampling fluctuations and heterogenous populations, that the two groups are not as similar as would be desirable. One way to avoid this problem is to match the groups. For example, if we were testing the effect of some agent on bone growth in rat calvaria in organ culture, we would cut the calvarium in two, and one half would serve as a control while the other half would be treated. This procedure would be repeated for each of the rat embryos. In this way, one variable (size of a half calvarium, which exhibits much less variability within a rat than within a litter) could be controlled. Matching enables the experimenter to reliably detect differences between groups with a smaller number in the experimental groups than would be needed in the random-groups designs. Unfortunately, the matched-group design is more sensitive to the detection of small effects at the expense of the generality of the conclusions; one can only guess whether the effect is large enough to be observed in randomly selected heterogenous populations.

Pairing The power of pairing is evident from an example, cited by Beveridge,13 in which growth rates of calves were studied; identical twins were about 25 times more useful than ordinary calves. Pairing is not always useful; one of van Belle’s14 statistical rules of thumb is that one should not pair data unless the correlation between the pairs is greater than 0.05.

Blocking: Building the extraneous variable into research design Blocking is the practice whereby groups of experimental units that have some property (such as males, females, samples obtained from one patient, littermates) are

made up in advance, and members of the block are assigned randomly to the treatments. Blocking might be efficiently used in a test of a drug that was applied to both males and females (Fig 20-1). The block test population would be divided into male and female blocks, and members of each block would be randomly assigned to the treated and control groups. In this way you could control the variation (because for many kinds of measurements, there would be large differences between males and females), but you could also find whether there was a difference in male and female response to the drug. Blocking makes the experiment more efficient because the effects of variation are reduced. In studies on rat caries, Marthaler15 found that a significant litter effect was found in 13 of 16 experiments. In other words, some litters were more or less caries-prone than others. To avoid this source of variation, we could arrange matters so that each treatment is applied to one member of each of a number of litters. By blocking the litters in this way, the investigators found that the efficiency of the blocked experiment was 143% relative to 100% for a completely randomized design.

Yoked controls Another way of managing an extraneous variable is to incorporate a suitable control into the experiment design. Consider Fig 20-2, which shows the influence of ascorbic acid on weight gain in guinea pigs. Since vitamin C would usually be in the diet of these animals, the group having the diet with vitamin C would be considered the control. The experimental group would be vitamin C deficient, and these animals would not gain weight in the same manner as the controls. However, most nutrient deficiencies cause a state of inanition, and the animal in the experimental group progressively restricts its intake of the deficient diet. Therefore, the experimental animals would be deficient not only in vitamin C but also in other nutrients. As explained by Alfano,16 this problem may be solved by including a pair-fed control—that is, animals receiving the vitamin C diet would be allowed to eat only the amounts consumed by their counterparts on the deficient diet. The result is shown in Fig 20-3. Although vitamin C still has an effect, it is not as large as would be inferred from Fig 20-2. The pair-fed control is an example of a yoked control. In this example, a relevant variable (total food consumption) on weight gain has been controlled.

293

Brunette-CH20.indd 293

10/9/19 11:59 AM

EXPERIMENT DESIGN

Blocking

Females

Random

Males

Allocation

Control

Allocation

Random

Treated

Control

Treated

Fig 20-1 Illustration of blocking technique to reduce variation and determine if a drug affects males and females differently.

Deficient Ascorbic acid supplement Pair fed

550

550

500

500

Weight (g)

Weight (g)

Deficient Ascorbic acid supplement

450 400 350 300

450 400 350

0

10

20 Days on diet

30

Fig 20-2 Influence of ascorbic acid on weight gain in guinea pigs. (Reprinted from Alfano16 with permission.)

300

0

10

20 Days on diet

30

Fig 20-3 Addition of pair-fed control to study in Fig 20-2. (Reprinted from Alfano16 with permission.)

294

Brunette-CH20.indd 294

10/9/19 11:59 AM

Some Common Experiment Designs

Table 20-1 | Ulcerogenicity scores displayed by treatment group Aspirin dose (mg/kg)

Indomethacin dose (mg/kg)

Ulcerogenicity score

A

0

0

0

B

250

0

6

C

0

10

23

D

250

10

71

Group

Table 20-2 | Ulcerogenicity scores displayed in factorial design format Indomethacin dose (mg/kg)

Aspirin dose (mg/kg)

0 250

0

10

0(A) 6(B)

23(C) 71(D)

Data from Shaw and Wischmeier.19

Data from Shaw and Wischmeier.19

Factorial design A dogma for some scientists is that the only legitimate way of doing experiments is to vary one factor at a time. This dogma has proved to be false. Fisher17 stated the following: No aphorism is more frequently repeated in connection with field trials than that we must ask Nature few questions or ideally one question at a time. The writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical and carefully thought-out questionnaire. Indeed, if we ask her a single question she will often refuse to answer until some other topic has been discussed. The need to vary several factors at a time is particularly important in areas of research where time is an important factor, such as agriculture, where there may be only one growing season per year. Accordingly, Fisher designed experiments that resembled carefully thought-out questionnaires. These factorial designs have the following advantages: 1. It is possible to obtain information on the average effects of all of the factors economically in a single experiment of moderate size. 2. It broadens the basis of inference on one factor by testing it under varied conditions of others. 3. It is possible to study the interactions of the factors.18 Suppose we were investigating the production of gastric mucosal damage in rats that were being given aspirin and indomethacin, separately and in combination. The data given here (adapted from Shaw and Wischmeier19) were expressed as ulcerogenicity scores (Table 20-1). The data could also be arranged as shown

in Table 20-2. First, we will consider the effect of aspirin. Two comparisons can be made: (1) the effect of aspirin by itself (ie, at 0 dose of indomethacin), which involves comparing groups B and A (ie, 6 versus 0), and (2) the effect of aspirin with 10 mg/kg indomethacin, which involves comparing groups D and C (ie, 71 versus 23). Note that in each of these two comparisons, the only variable that differs between the groups is the presence of aspirin. The average effect for aspirin can be calculated as follows: Aspirin effect = ½ [(B – A) + (D – C)] = ½ [(6 – 0) + (71 – 23)] = 27 Similarly, the effect of indomethacin can be calculated as follows: Indomethacin effect = ½ [(C – A) + (D – B)] = ½ [(23 – 0) + (71 – 6)] = 44 The two effects calculated above are the so-called primary effects. If there were no interaction (discussed later) between the variables, a model of the data could be developed in which each observed data point could be constructed as follows: Y (data point) = M (mean) + effect of aspirin + effect of indomethacin + error The idea behind the model is that aspirin and indomethacin contribute to changing the observed data from the overall mean value. For group C: YC = 23 =

295

Brunette-CH20.indd 295

10/9/19 11:59 AM

EXPERIMENT DESIGN

Table 20-3 | Interaction effect of aspirin and indomethacin on ulcerogenicity Data

Mean

Aspirin

Indomethacin

Interaction

0

23

25

25

–13.5

–13.5

–22

22

10.5

–10.5

6

71

25

25

13.5

13.5

–22

22

–10.5

10.5

+25 + (–13.5)

+ (+22)

The mean for all 4 data points equals 25. The effect of aspirin was 27; this is partitioned as –13.5 to those groups that do not get aspirin, such as group C, and +13.5 to those that do. The effect of indomethacin partitioned as –22 to those that do not receive indomethacin, and +22 to those that do receive indomethacin, such as group C.

+ [error] From this we can calculate that the error term equals –10.5. However, our model did not consider that there might be an interaction between the variables. The interaction effect can be considered the question of whether the increase in ulcers in rats treated with indomethacin depended on the presence of aspirin. The interaction is calculated as follows: ½ [(D – B) – (C – A)] = ½ [(71 – 6) – (23 – 0)] = 21 The interaction effect would be equally partitioned among all four points. To incorporate this idea, the model would be revised as follows: YC = 23 = (mean + aspirin effect + indomethacin effect + interaction) + error = 25 – 13.5 + 22 – 10.5 = 23 Thus, after considering the interaction, the error term equals 0. (Note that because there are no replications in this data set, error cannot be estimated.) Overall, the model for this experiment can be represented as shown in Table 20-3. This model indicates how the data were produced (ie, why the data points are what they are). Next we must ask: How are the data analyzed? Is the effect of aspirin or indomethacin significant? The latter question cannot be answered for the example data set because there are no replications.

If there were replications—ie, more than one data point per cell—the analysis would be accomplished by analysis of variance (ANOVA). For factorial designs, this can be quite complex; however, most statistics packages for microcomputers are able to perform such analyses. Note that, generally, when we include the interaction term, the value estimated for the error drops. In ANOVA, the reduction of the error has an important consequence—it becomes easier to demonstrate a significant effect (because the mean square of the error is reduced). Because most research workers are interested in obtaining significant effects, this technique appears to be the royal road to happiness, but, like many good things, there is a catch. If an interaction proves significant, one must modify the conclusion concerning the two primary effects. It is possible that these primary effects are significant only because of the contribution of the data points where the interaction occurred (ie, the primary effects are only effective when they are applied together). This can be tested by another experiment—or in some cases by a reevaluation of the data, if they can be combined so that the points where the interaction occurred are not considered. In summary, the conclusions regarding the main effects have to be modified in such a way as to make them less general either by stating that there exists some subgroup within the experiment where the primary effect was greater than other subgroups, or by stating that the primary effect was not found at all in some subgroups. Complex factorial designs are being used in biomedical research, but comparatively rarely. This is unfortunate given the method’s power. When we plot the data from a factorial design (Figs 20-4 and 20-5), we can get an indication of whether an interaction occurs. If no interaction occurs (see Fig 20-5), the lines are parallel. However, in the paper by Shaw and Wischmeier19 there does seem to be some interaction between indomethacin and aspirin because there is a marked difference in slope (see Fig 20-4). There is much more ulcerogenicity in the presence of aspirin than in its absence. Thus, the authors concluded that “the combined use of these nonsteroidal anti-inflammatory agents has high ulcerogenic potential at doses of each not heretofore considered as such.”

296

Brunette-CH20.indd 296

10/9/19 11:59 AM

Some Common Experiment Designs

40

60 40

10 mg/kg indomethazine 0 mg/kg indomethazine

20

Ulcerogenicity

Ulcerogenicity

80

20

Plus drug X Minus drug X

0

0 0

100

200

300

0

100

200

300

Aspirin (mg/kg)

Aspirin (mg/kg)

Fig 20-4 Data from factorial design, indicating an interaction between indomethacin and aspirin.

Fig 20-5 Factorial design data, indicating no interaction between drug X and aspirin. Lines are parallel.

Table 20-4 | Example of crossover design Group 1 (G1) Proxygel then placebo

Group 2 (G2) Placebo then Proxygel

Difference Proxygel and placebo

Measured at end of period 1 (P1)

2.44 (YP G )

3.17 (YP G )

–0.73

Measured at end of period 2 (P2)

2.45 (YP

1.86 (YP

–0.59

Difference P1 – P2

1 1

2G1

)

–0.01

1 2

2G2

)

+1.31

Adapted from Zinner et al20 with permission.

Counterbalanced designs to reduce problems of individual variations In counterbalanced designs, each subject receives each treatment, but one subject or group is tested in one sequence of conditions, while another subject or group is tested in a different sequence. These are also called crossover designs because the groups exchange treatments. Crossover designs have been used frequently in dental research studies using a placebo and an active compound. For example, Zinner et al20 tested the effects of Proxygel (Reed & Carnrick) and a placebo on the oral hygiene index (OHI). An abridged version of the data is given in Table 20-4. At the end of the first period, the mean OHI value for Group 1 (which received the Proxygel first) is designated (YP G ); the value for the same group at the end of the 1 1 second period is (YP G ). For the group that received the 2 1 placebo first (Group 2), the values are labeled (YP G ) 1 2 and (YP G ), respectively. We can see that the Proxygel 2 2 apparently caused a decrease in the OHI, because of the following:

1. At the end of the first period, the OHI value fell for the Proxygel group, and their score was less than the score for the placebo group: (YP

) < (YP

1G1

)

1G2

2. When, during period 2, the group that received the placebo first was given the Proxygel treatment, their OHI values declined: (YP

) > (YP

1G2

)

2G2

When the data from this experiment were treated as a simple bivalent random-groups design (ie, the groups were compared at the end of period 1 using the Student t test), a t value of 2.92 was obtained. However, the t value calculated using the differences between the periods for each group yielded the much higher t value of 5.91. (You will recall that the higher the value of t, the less likely the result can be explained by chance.) Thus, the use of a crossover design produced a more sensitive test. It is clear that the design could be improved by taking readings before any treatments were started

297

Brunette-CH20.indd 297

10/9/19 11:59 AM

EXPERIMENT DESIGN

because this would enable additional comparisons to be made. However, this level of complexity is beyond the scope of this book, and the interested reader is referred to Varma and Chilton’s21 article or Jones and Kenward’s22 book, both devoted to crossover designs, for more details. The advantage of counterbalanced designs is that each subject serves as its own control. Because as a general rule there is less variation in a property of the same individual measured at different times than there is variation among different individuals at different times, it is easier to detect small differences. Another advantage of the design is that statistical analysis may show that effects of order are important. Weaknesses of the counterbalanced design surface when there is an interaction between the groups and variables being tested. In that event, the simple comparison of the means of the two tests would be valueless. With regard to this problem, Schor23 has suggested that some clinical investigators of new drugs seem to have contracted a contagious disease called crossoveritis. The symptoms of the disease include an investigator’s belief that two drugs can be tested properly only by testing them on the same people (as occurs in a crossover design). As noted previously, if there is an interaction, or—in the case of drugs—if the effect of one treatment is so great that it does not disappear when the treatment is discontinued, then the results are not trustworthy. In crossover experiments, a key goal is to allow a sufficient period of time (called a washout period) between treatments, so that there are no residual effects from the first treatment. Sometimes this is difficult or impossible to arrange.

Split-mouth design The split-mouth design used in dentistry subdivides the mouth into experimental units of halves, quadrants, or sextants, and different treatments are applied to these experimental units. Because comparisons are made within the patient, variability is expected to be less than that in studies (with patients as the experimental unit) in which comparisons are made between patients. A potential disadvantage is that treatments performed in one part of the mouth can affect treatment in other parts of the mouth—a phenomenon that has been called the carry-across effect.24 If carryacross effects exist, then treatment effects cannot be measured directly because they include the sum of all carry-across effects.22 A study of the efficiency of split-mouth designs found that this design produces

moderate to large gains in efficiency when disease characteristics are symmetrically distributed over the experimental units and a sufficient number of sites is available but that in the absence of a symmetric disease distribution, whole-mouth clinical trials may be preferable.25

Other possible relationships among time, testing, and treatment Counterbalanced designs are only one option for the arrangement of time, testing, and treatment. Campbell and Stanley26 developed a classification scheme that has become a standard for behavioral science and educational research. Appendix 5 lists these designs and their sources of invalidity in terms of the major relevant variables: history, maturation, testing, instrumentation, regression, selection, mortality, and their interactions. In experiment design, investigators consider internal and external validity. Another way of thinking about internal validity (discussed in chapter 16) is that it refers to the degree to which the independent variable brings about change in the dependent variable. Extraneous factors that are not controlled decrease the internal validity because you cannot be sure that changes in the independent variable account for the results. When a study is said to not be internally valid, it means that plausible rival hypotheses could explain the results. Internal validity is of prime importance when the experiment is testing a hypothesis or establishing a mechanism. An experiment with poor internal validity will not add to our understanding of the processes under consideration. As noted earlier, external validity refers to the degree to which a study reflects events that occur in the real world. Threats to external validity include failure (or inability) to select a random sample, or interactions between a treatment and the particular subjects studied. External validity is of prime importance when an experiment’s purpose is related to an application. An experiment with poor external validity will not allow us to predict what will happen when a treatment is applied in the real world. Campbell and Stanley26 categorized experiment designs as pre-experimental, quasi-experimental, and true experimental. A brief summary of this classification system—some components of which were discussed earlier—follows. Pre-experimental designs include the one-shot case study and the one-group pretest/posttest design. According to Campbell and Stanley, a design such as the one-shot

298

Brunette-CH20.indd 298

10/9/19 11:59 AM

Some Common Experiment Designs

case study has such an absence of control as to be of almost no scientific value. Quasi-experimental designs include time series, equivalent time series, equivalent material samples, nonequivalent control group, separate sample pretest/ posttest, separate sample pretest/posttest control group, multiple item series, institutional cycle, and regression discontinuity designs. The quasi-experimental designs resemble true experimental designs in that some manipulation is done to look for an effect on the dependent variable but differ from true experiments in that either a control or randomization is lacking. Campbell and Stanley27 deem these quasi-experimental designs worthy of use where better designs are not feasible (see below). In true experimental designs, the investigator is able to manipulate a variable (for example, apply a treatment), randomize the groups (so that experimental and control groups are, in theory, similar), and control (or eliminate) interfering and irrelevant influences from the study. True experimental designs include the pretest/ posttest control group, Solomon four-group, and posttest only control group designs, as well as other more complex elaborations such as factorial design.

Quasi-experimental design In several circumstances it is not possible to use randomization or to manipulate the introduction or withholding of treatment. Two main types of quasi-experimental design are the nonequivalent control group and the previously described interrupted time series. The nonequivalent control design entails using groups in which the subjects have not been assigned randomly. Lack of randomization raises the possibility that there may be preexisting differences between the groups that could explain differences in outcome between the treated and control groups. In particular, there is the possibility of “confounding by indication.” For example, in the evaluation of vaccine effectiveness, patients with poor prognoses are more likely to be immunized. Thus, selection for vaccination is confounded by patient factors that are likely related to clinical end points.28 Similarly, the time series designs are subject to temporal confounders and maturation, as well as regression to the mean. Despite these drawbacks, necessity sometimes dictates the use of quasi-experimental designs, such as in the study of infection control and antibiotic resistance. For example, an outbreak of resistant organisms in a hospital might leave little time to implement all of the procedures used in an RCT, because it must be dealt with immediately. Harris et al29 have

systematically reviewed reports using quasi-experimental designs in infection control and antibiotic resistance and have concluded that the conduct and presentation of such studies needs to be improved so that interventions are evaluated more rigorously. For example, segmented regression analysis is a powerful tool for evaluating longitudinal effects of interventions that has considerable potential to improve conclusions from interrupted time series studies.30 Although quasi-experimental designs lack the formal rigor of true experimental designs, they still can be informative; the challenge is for the investigators to consider—and, ideally, rule out—alternative explanations of their results so that their interpretations are sound.

Some issues related to RCTs The dominance of the RCT in epidemiologic thinking leads some to conclude that it is the only reliable way of seeking knowledge, but this is obviously not true. For example, Grossman and Mackenzie31 relate a case where an evidence-based working group discounted the contribution of 13 observational studies that all agreed in favor of the single RCT with an opposing conclusion. Moreover, RCTs often use patients with less severe disease and can thus mislead one’s understanding of a treatment’s usefulness. RCTs also use patients who are relatively young, yet the results are extrapolated inappropriately to older patients; in fact, some generalizations are made to patients who have been excluded from the study.32 In essence, the problem is that, to increase internal validity and maximize the size of effect, investigators typically choose relatively homogenous groups of patients, usually those who have a strong potential to benefit from the treatment. In my own studies on agents that reduce oral malodor, for example, my colleagues and I selected individuals who had considerable oral malodor. This strategy was necessary to increase our chances of seeing a significant effect. But it came at a price, because it was no longer clear what benefit those with less extreme mouth odor would receive. In other words, as is often the case, we increased internal validity at the expense of external validity. One approach to this problem is to widen the spectrum of subjects included in the trial. However, if people who can benefit only minimally from a treatment are included, the average effect size will be reduced. Moreover, interpretations normally made from homogenous groups where everyone received a benefit cannot be made from heterogenous groups where the benefit

299

Brunette-CH20.indd 299

10/9/19 11:59 AM

EXPERIMENT DESIGN

may have depended on a patient’s particular characteristics. Heterogeneity of treatment effects is the term used to describe the differential response to the same treatment by different patients with different characteristics (such as severity of disease, age, sex, genetic makeup, and so forth). In the chlorhexidine-soaked dental floss trial, my graduate student Pauline Imai and I found that the treatment worked better for sites in which the pocket depth was less than 4 mm. The probable reason was that floss likely goes down into the pockets only about 2 to 3 mm, so the treatment was not being delivered to the bottom of the deeper pockets. The consequence was that our interpretation had to be modified to restrict the range of sites where this modality might be effective. One can imagine extreme instances where a treatment is superbly effective for some patients and ineffective, or even damaging, for others, yet the average effect would work out to be positive. Such a situation could result in a treatment being recommended on the basis of average effects to patients who would not benefit or who would even suffer from it. Other challenges include problems of blinding and randomization, availability of a suitable placebo, inaccurate statistical analysis, inappropriate surrogate outcome measures, and false negatives. Although the RCT is a powerful design, its interpretation and clinical application are not necessarily straightforward. In a review of 39 RCTs published in major general clinical journals, Ioannidis33 found that 9 of the 39 studies had been contradicted or demonstrated to have stronger effects by other studies. A significant problem has been the tendency to view one well-controlled trial that achieved statistical significance to be definitive. As Fisher emphasized, the test of a firm conclusion is repeated studies, each of which demonstrates statistical significance.

7. 8. 9.

10.

11. 12. 13. 14. 15.

16.

17. 18. 19. 20.

21. 22. 23. 24. 25. 26. 27. 28.

References 1. 2. 3. 4. 5. 6.

Salsburg D. The Lady Is Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: Henry Holt, 2001. Finney DJ. Experimental Design and Its Statistical Basis. Chicago: University of Chicago, 1955. Fisher RA. The Design of Experiments, ed 8. New York: Hafner, 1965. Fleiss JL. The Design and Analysis of Clinical Experiments. New York: Wiley, 1986. Chilton NW. Design and Analysis in Dental and Oral Research. New York: Praeger, 1982. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963.

29.

30.

31.

32.

33.

Vlietstra JR, Sidaway DA, Plant CG. Cavity cleansers. A simple in vitro test. Br Dent J 1980;149:293–294. Conant JB. On Understanding Science. New York: Mentor, 1951. Chalmers TC, Celano P, Sacks HS, Smith H Jr. Bias in treatment assignment in controlled clinical trials. N Engl J Med 1983;309:1358–1361. Blair E. Gold is not always good enough: The shortcomings of randomization when evaluating interventions in small heterogeneous samples. J Clin Epidemiol 2004;57:1219–1222. Gilbert N. Biometrical Interpretation. Oxford: Clarendon, 1973. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963. Beveridge WIB. The Art of Scientific Investigation. New York: Vintage, 1950. van Belle G. Statistical Rules of Thumb. New York: Wiley, 2002. Cited in Konig KG. Design of animal experiments in caries research. In: Harris RS, Caldwell RC (eds). Art and Science of Dental Caries Research. New York: Academic, 1968. Alfano MC. Controversies, perspectives, and clinical implications of nutrition in periodontal disease. Dent Clin North Am 1976;20:519–548. Fisher RA. The arrangement of field experiments. J Minist Agric 1926;33:503. Finney DJ. Experimental Design and Its Statistical Basis. Chicago: University of Chicago, 1955. Shaw DH, Wischmeier C. Combined ulcerogenicity of aspirin and indomethacin in the rat. J Dent Res 1976;55:1133. Zinner DD, Duany LF, Chilton NW. Controlled study of the clinical effectiveness of a new oxygen gel on plaque oral debris and gingival inflammation. Pharmacol Ther Dent 1970;1:7–15. Varma AO, Chilton NW. Crossover designs involving two treatments. J Periodontal Res 1974;9(suppl 14):160–170. Jones B, Kenward MG. Design and Analysis of Cross-over Trials. London: Chapman & Hall, 1989. Schor SS. How to evaluate medical research reports. Hosp Physician 1969;5:95. Hujoel PP, DeRouen TA. Validity issues in split-mouth trials. J Clin Periodontol 1992;19:625–627. Hujoel PP, Loesche WJ. Efficiency of split-mouth designs. J Clin Periodontol 1990;17:722–728. Campbell DT, Stanley JC. Experimental and Quasi-experimental Designs for Research. Chicago: Rand McNally, 1966:6. Campbell DT, Stanley JC. Experimental and Quasi-experimental Designs for Research. Chicago: Rand McNally, 1966:34. Hak E, Verheij TJ, Grobbee DE, Nichol KL, Hoes AW. Confounding by indication in non-experimental evaluation of vaccine effectiveness: The example of prevention of influenza complications. J Epidemiol Commun Health 2002;56:951–955. Harris AD, Lautenbach E, Perencevich E. A systematic review of quasi-experimental study designs in the fields of infection control and antibiotic resistance. Clin Infect Dis 2005;41:77–82. Wagner AK, Soumerai SB, Zhang F, Ross-Degnan D. Segmented regression analysis of interrupted time series studies in medication use research. J Clin Pharm Ther 2002;27:299–309. Grossman J, Mackenzie FJ. The randomized controlled trial: Gold standard, or merely standard? Perspect Biol Med 2005;48:516– 534. Greenfield S, Kravitz R, Duan N, Kaplan SH. Heterogeneity of treatment effects: Implications for guidelines, payment, and quality assessment. Am J Med 2007;120(4 suppl 1):S3–S9. Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA 2005;294:218–228.

300

Brunette-CH20.indd 300

10/9/19 11:59 AM

21 Statistics As an Inductive Argument and Other Statistical Concepts

Statistics ≠ Scientific Method Scientific method does not necessarily require the use of statistics, as is evidenced by some of the classic papers in molecular biology and biochemistry. The power of their experiment systems is so great that differences between the items being compared are evident enough not to require statistical tests. These differences may be observed directly (eg, the appearance or disappearance of a particular molecule that can be identified by its location on a polyacrylamide gel). This disclaimer notwithstanding, it should be noted that most clinical dental research does benefit from a statistical approach because the size of the effects being observed is often small and it is difficult to distinguish the effects from natural variation. The number of articles in the dental literature that employ statistics has risen rapidly, and instruction in statistics is now common—though not necessarily effective2—in dental schools. The statistics used in most research papers are different from the statistics of everyday usage, such as reported census data or baseball batting averages, which provide comprehensive information about the population involved. For example, in calculating a batting average, we know every time a player went to bat, and we know the player’s exact number of hits, so the batting average completely and accurately represents the player’s performance.



A judicious man looks at statistics not to get knowledge, but to save himself from having ignorance foisted upon him.” THOMAS CARLYLE 1

301

Brunette-CH21.indd 301

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

Statistical Inference Considered As an Inductive Argument Prior prediction To be testable by standard statistical procedures, a hypothesis must predict some particular distribution of a measured value. The hypothesis is best made in advance of the experiment. This point can be illustrated by looking at a table of random numbers. The frequencies of certain numerals will be higher in some sequences than would be predicted by chance alone. A statistical test would judge the departures from predicted values to be significant. Could we then say that the tables of random numbers are not random? We could not, because such a test—constructed after the data had been collected—would not be valid. Any sample of random numbers can be expected to exhibit some unusual sequences, but if they are truly random numbers, we cannot predict what those sequences will be. As noted in our discussion of evaluation of hypotheses (chapter 8), successful prediction is considered stronger support for a hypothesis than an equal amount of data known when the hypothesis was put forward. The same arguments hold true for statistical analysis of research data. If a set of data shows some effect that was not suspected prior to the experiment, the conservative strategy is to test for the effect in a subsequent experiment. An alternative is to use special statistical tests that have been developed for after-thefact (a posteriori) analysis. In the example of hypothesis testing in chapter 8, prior prediction was assumed.

Golden rule of sample statistics A number of features of statistical reasoning about samples were not illustrated in the previous chapters. Some of these may be summed up in what I will call the golden rule: Any sample should be evaluated with respect to size, diversity, and randomness.3 The statistics in research reports are based on samples. The obvious questions to ask are the following: (1) How representative of the target population are the samples? Was the sample selected randomly, or were there factors present that could lead to biased selection? (2) How large is the sample? (3) How much diversity exists in the sample? Does it cover all of the groups that make up the population? Fallacies, as well

as acceptable arguments, can arise from statistical inference.

Induction by enumeration Induction by enumeration is commonly used in scientific thinking. In this form of reasoning, a conclusion about all of the members of a class is drawn from premises referring to observed members of that class. premise conclusion

Four of the 20 patients experienced a problem after endodontic treatment. Twenty percent of all patients experience problems after endodontic treatment (inductive generalization).

Induction by enumeration, also called statistical generalization, can yield false conclusions from true premises. For example, there was no estimate of the error in the previous value. It is unlikely that a second sampling of 20 patients would also yield a value of exactly 20% failures. We could try to reduce the chances of being accused of a false conclusion by hedging the conclusion. For example, we could write “about 20% experience problems.” Hedging does not solve the problem. A more precise solution would be to use probability theory to calculate your confidence in the statement. Because this example is based on the binomial theorem with two mutually exclusive outcomes (success or failure) with a fixed probability, we could use the table in appendix 2. Using the example of University of British Columbia (UBC) dental graduates in chapter 10, consider Dentist B, who had 16 successes and 4 failures. According to the table in appendix 2, this result indicates that Dentist B is 95% confident that his percentage of failures lies between 5.8% and 44%. We can see that precisely stating the confidence interval gives a rather different perspective on the data. There is no reason for Dentist B to be overconfident; a failure rate as high as 44% is included in this interval.

Fallacy of insufficient statistics (or hasty generalization) In the fallacy of insufficient statistics, an inductive generalization is made based on a small sample. premise conclusion

My mother and my wife smoke. All women smoke (fallacy of insufficient statistics).

302

Brunette-CH21.indd 302

10/9/19 12:02 PM

Statistical Inference Considered As an Inductive Argument

If this argument were used as sole evidence for the conclusion that all women smoke, the argument would be classified as proof by selected instances. Associated with this fallacy is the problem of how many is enough. The answer depends on how large, how varied, and how much at risk is the population being studied. Numbers that appear quite large can still be inadequate. Huff3 recounts a test of a polio vaccine in which 450 children in the community were vaccinated, while 680 were left unvaccinated as controls. Although the community was the locale of a polio epidemic, neither the vaccinated children nor the controls contracted the disease. Paralytic polio had such a low incidence—even in the late 1940s—that only two cases would have been expected from the sample size used. Thus, the test was doomed from the start.

What is a reasonable sample size? There is no simple answer to this question, but two general considerations appear to be relevant: tradition and statistics.

to scientific knowledge nor enhance the careers of their authors. These “file-drawer papers” emphasize the need to understand the relationship between sample size and the establishment of statistically significant differences. Failure to use an appropriate sample size leads to much wasted effort.

Role of the sample size and variance in establishing statistically significant differences between means in the t test The size of an adequate sample is a complex question that will not be addressed in great detail here. However, some insight into the problem can be gained by examining the formula for the t test, which is perhaps the most common statistical test used in biologic research. It should be used when comparing the means of only two groups. The t test exists in several forms that depend on whether the samples are related (eg, the paired t test) or independent. Below is the formula used for a two-sided comparison when comparing the means of the measured values from two independent samples with roughly equal variances of unpaired subjects:

Tradition In some research areas, experience shows that consistent and reliable results require a certain number of subjects or patients; for example, Beecher4 recommends at least 25 patients for studies on pain. These traditions probably develop because of practical considerations such as patient availability, as well as inherent statistical considerations such as subject variability. The underlying philosophy in this approach is that the replication of results is a more convincing sign of reliability than a low P value obtained in a single experiment. If a number of investigators find the same result, we can have reasonable confidence in the findings. Means of combining and reviewing studies are given in Summing Up by Light and Pillemer.5 There are problems with the traditional method of choosing an arbitrary and convenient number of subjects. If the sample is too large, the investigator wastes effort, and the subjects are unnecessarily exposed to treatments that may not be optimal. However, most often the number of subjects is too small; the size of the sample is usually limited by subject availability. Thus, studies that use low numbers of subjects and report no treatment effect should be considered with skepticism. In fact, some journals have a policy of rejecting papers that accept null hypotheses. Such negative results papers end up in a file drawer. They neither contribute

t =

difference in means SE of the differenc ce between means t=

X A – XB sX A – X B

SX A – XB =

(n A – 1) sA2 + (n B – 1)s B2 n A + n B– 2

×

1 1 + nA nB

where nA, nB = number in samples A and B, respectively; X–A, X–B = means of samples A and B, respectively; sA2 , sB2 = variance of groups A and B, respectively; sX–A – sX–B = standard error (SE) of the difference between means; and nA + nB – 2 = degrees of freedom. A high observed t value indicates the unlikeliness that the two samples are drawn from a population with the same mean. Having selected a significance level and calculated a quantity called degrees of freedom (df) (in this case df = nA + nB – 2), we can consult a table of published t values such as can be found in appendix 3. If the observed t value is higher than the critical level, we reject the null hypothesis. For example, consider once again the study by Keene et al,6 who reported the decayed, missing, or filled teeth (DMFT) data found in Table 21-1:

303

Brunette-CH21.indd 303

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

Table 21-1 | Relationship of Streptococcus mutans biotype to DMFT Mean

SE

SD

n

e-type carriers [+]

Biotype

4.91

.48

3.59

56

Non-e-type carriers [–]

3.51

.28

3.31

140

SD, standard deviation. (Data from Keene et al.6)

t=

X e – X non–e 55(3.592 ) + 139(3.312) 194

×

1 1 1 + 56 140

1.40 = = 2.59 .54

variability in the subjects, this small sample often means that no difference is detected, even when it is likely that a real difference exists.

Effect-size approach df = 140 + 56 – 2 = 194.0 The critical value of t for 194 df at an α level of .05 = 1.96. Since the calculated value of t = 2.59 > 1.96, we can reject the null hypothesis and conclude that there is a significant difference in DMFT between the e-type and non-e-type carriers. From the formula for the statistic, we realize the following: 1. A large difference between means will increase the t value. This makes sense. Large effects should be easier to see. If the effect is large, the sample size required to distinguish the groups will be smaller. 2. The t value is decreased when the variance is large. (Remember that variance is related to the spread of measured values around the mean.) A large variance tends to produce a small t value, making it harder to detect differences between groups. 3. The t value is increased when the number in the sample is large. From the formula, it can be calculated that the decrease in the SE value is proportional to the square root of the sample number. Thus, if you increase the sample size by a factor of 10, you only decrease the SE by 3.16. This may be an inefficient way of reducing the error, but it works. Bakan7 notes that increasing the sample size (n) almost inevitably guarantees a significant result. Even very small alterations in experimental conditions might produce minute differences that will be detected by a statistical test made very sensitive by a large sample size. In the dental sciences, the reverse is the more common problem; that is, frequently there is only a small number in the sample. Combined with a large

Choosing the optimal sample size is a complex business that depends on the difference between groups, the variability of the groups, the experiment design, and the confidence level. A rough-and-ready way to estimate an appropriate sample size using the effect-size approach8 for a simple experiment with a treated and a control group follows: 1. Estimate what you think is an important (or important to detect) difference in means between the treated and control groups. 2. Use the results of a pilot study or previously published information on the measurement to estimate the expected pooled standard deviation (SD) of a reasonably sized sample. 3. Using the values found in (1) and (2) above, calculate the d statistic (formula given on page 308). 4. Choose your level of confidence (5% or 1%). 5. Look up the number of subjects needed in appendix 6. Using the same approach, we can also work backward to decide if a paper’s authors used a reasonable sample size. This is particularly important if no effect was seen, because the sample may have been too small to produce a sensitive experiment. To check this possibility, do the following: 1. Use the authors’ data to estimate the d statistic. 2. Look up the table in appendix 6 to determine the number of subjects that should have been used. 3. Compare the number from the table with the actual number used. Effect- and sample-size tables for more sophisticated designs can be found in Cohen’s8 book Statistical Power Analysis for the Behavioral Sciences.

304

Brunette-CH21.indd 304

10/9/19 12:02 PM

The Fallacy of Biased Statistics

Dao et al9 examined the choice of measures in myofascial pain studies. Based on the characteristics of the measurements, they calculated how many subjects would be required to observe significant groups differing from each other by specified amounts. Their technique was based on the statistical power analysis developed by Cohen8 and is similar to the effect-size calculation given earlier. Their values for within- and between-subject variance were obtained from a population referred to as University Research Clinic. The technique used to measure pain was a visual analog scale (VAS), shown to be a rapid, easy, and valid method that provides a more sensitive and accurate representation of pain intensity than the descriptive scales. They found that detecting a 15% difference in pain intensity between treatment and control groups in an experiment with three groups would require 242 subjects per group. However, to detect an 80% difference in pain intensity, only eight subjects per group would be needed, and the total study size would be 24 subjects. Even with the sensitive VAS method to measure pain, the traditional approach of using 25 subjects and descriptive scales could probably distinguish only very large differences between groups (ie, ≈ 80% differences, if three groups were used).

A handy rule for the effect of sample size on confidence intervals Van Belle’s Statistical Rules of Thumb10 includes a chapter on calculating sample sizes and provides a number of other “rules of thumb” useful in performing practical statistical analysis. Figure 21-1, taken from van Belle,10 shows the half-width of confidence intervals with sample size. One can see that the width of the confidence interval decreases rapidly until 12 observations are reached and then decreases more slowly.

The Fallacy of Biased Statistics To review briefly, descriptive statistics are simply efficient ways of describing populations. For our purposes, a population is a collection of all objects or events of a certain kind that—at least theoretically—could be observed. The specific group of objects (events, subjects) observed is the sample. Inferential statistics are concerned with the use of samples to make estimates and inferences about the larger population. This larger population from which the sample is selected is also called the parent population, or, perhaps more

3.0 95%

2.5

99% 2.0 1.5 1.0

90%

0.5 0 0

5

10

15

20

25

30

Sample size Fig 21-1 Half-width confidence interval assuming a t statistic with n – 1 df and sample size n. Confidence level curves are shown for confidence levels of 90%, 95%, and 99%. (Reprinted from van Belle10 with permission.)

accurately, the target population. This is the population that we hope the sample represents, so that we can generalize our findings. The fallacy of biased statistics occurs when an inductive generalization is based on a sample that is known to be—or is strongly suspected to be—nonrepresentative of the parent population. This problem of nonrepresentative samples is associated with the randomness of sample selection and the spread of the sample. Gathering numbers to obtain information may be likened to an archer shooting at a target. The bull’s-eye represents the true value of the population in question. The place where the arrow hits represents one value from the sample. Bias is the consistent repeated divergence of the shots from the bull’s-eye. For an archer, this bias may be caused by a factor such as a wind blowing from one direction that causes the arrows to hit predominantly on one side of the target. In research experiments, bias may be caused by factors leading to nonrandom sampling. There are many examples of biased polls giving erroneous results. Perhaps the most famous is the Literary Digest poll that predicted, on the basis of a telephone survey and a survey of its subscribers, that a Republican victory was assured in the 1936 election.3 The number of people polled was huge—over two million—but the prediction was wrong because the people polled were not representative of the voting public. In fact, they were wealthier than average because, at that time, telephones were found mainly in the homes of the wealthy. Thus, the

305

Brunette-CH21.indd 305

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

Literary Digest poll selected relatively wealthy people, who stated that they would vote Republican. Today, the public opinion polls conducted by the Gallup and Harris organizations interview only about 1,800 persons weekly to estimate the opinions of United States residents age 18 and over, but these polls are reasonably accurate because the sample is chosen by a stratified random sampling method that is not biased. Bias can appear in many different ways, as denoted by Murphy’s11 definition: Bias is any trend in the choice of a sample, the making of measurements on it, the analysis or publication of findings, that tends to give or communicate an answer that differs systematically (nonrandomly) from the true answer. Much of epidemiology is concerned with searching out various kinds of bias. An entertaining guide to this topic is the Biomedical Bestiary by Michael et al.12 The topic also is discussed comprehensively by Sackett,13 who cataloged no fewer than 56 types of bias. Only a brief overview is presented here.

Who is studied (selection bias) Persons being studied may differ in important ways from the larger population they are supposed to represent; this is known as selection bias. Selection bias was exemplified in the case of the Literary Digest poll. Persons who seek medical attention for a disease are likely to be those who are the sickest—sometime leading physicians to conclude that the disease is more debilitating than it really is. Selection bias in clinical studies can arise in different ways; the following are two types that particularly affect dental research. 1. Volunteer bias can occur if a significant number of eligible subjects do not agree to participate. Participants tend to be more motivated, more compliant, and, in clinical studies, destined for better outcomes than those who decline to participate. Compliant volunteers likely practice oral hygiene more rigorously than the general population; treatments that require strict plaque control would be expected to be more effective on volunteers. Dental research often uses dental students as volunteers, but this group obviously differs from the general population in their attitude about oral hygiene. 2. Withdrawal bias. Patients who are withdrawn from a study or refuse treatment may differ systematically

from those who remain or accept treatment. This bias probably explains the results of a review of a periodontal practice that found those who elected surgical treatment had better periodontal health than those who refused treatment and left the practice. In a trial of a beta-adrenergic blocking agent on patients recovering from heart attack, 13.9% of patients in the placebo (control) group died suddenly, whereas only 7.7% of the drug-treated patients suffered the same fate.14 The data were statistically significant, and it appeared that the drug produced a potentially clinically important reduction. However, because of the side effects of the drug (such as bradycardia and hypotension), drug-treated subjects withdrew from the study more frequently than the placebo-treated subjects. It is possible that patients who are sensitive to drug side effects are also susceptible to arrhythmias and sudden death. The observed selective withdrawal raises the possibility that the difference in outcomes may be caused by the high-risk people removing themselves from the treated group, and that the drug itself has no effect. Therefore, at least three alternatives could explain differences found between treated and control groups: (1) chance fluctuations—but these can be shown to be unlikely by statistical techniques; (2) a real effect—ie, the drug does prevent sudden deaths; or (3) the withdrawal of high-risk patients from the treated group. Because a plausible alternative hypothesis exists, a critic would be on solid ground in refusing to believe that the drug had an effect. The investigators would have to defend their hypothesis from such a criticism by gathering more data, referring to other data, or—as was done in this study—analyzing the data in a different manner so that the criticism could be met.

How the sample is studied Detection biases For various reasons, such as knowledge of exposure to a putative cause (diagnostic suspicion bias), some patients (cases) may receive more intensive or prolonged examination than the controls. Cases may be examined many times on each recall visit, whereas the controls are examined only once (recall bias).

306

Brunette-CH21.indd 306

10/9/19 12:02 PM

Other Statistical Concepts

Observational error biases Certain measures may differ systematically from their usual level because of the procedures or conditions used to make the measurement—eg, blood pressure during medical interviews. Measurements that hurt, embarrass, or invade privacy may be systematically refused or avoided. Instruments or techniques may be inaccurate or insensitive.

Response bias This systematic error results when subjects respond inaccurately to an investigator’s questions. For instance, dental floss sales do not correspond to the number of people who report using dental floss in surveys. Presumably, the subjects lie about their oral flossing practices to give a more positive impression of their personal hygiene to interviewers. A subset of the response bias is the obsequience bias, in which subjects systematically alter their responses in the direction they perceive desired by the investigator. For this reason, reviews of teaching effectiveness are best done with the students remaining anonymous, and that is the normal practice. Some years ago, I learned that a colleague circumvented the system and managed to collect comments individually from the students. The reviews were very good.

Sampling and treatment allocation bias Bias can also occur in sampling or allocation of treatments within an individual. For example, in a study on the effect of an antibiotic on periodontal disease, the authors proceeded as follows: (1) microbiologic samples were obtained from the most severely periodontally involved site and were therefore not representative of all sites; (2) the most severely involved quadrant was treated with root planing; another quadrant served as the control. Thus, there was a systematic difference between the control and treated quadrants.

Other Statistical Concepts When the null hypothesis is rejected There are at least five possible explanations for the deviation of a sample from the real value or the value proposed by a hypothesis.

Sometimes there are more than two alternatives Statistical methods are useful when random errors are the primary source of variation, but there are other types of error. As outlined previously, when testing a hypothesis, it is assumed that there are two mutually exclusive alternatives. One is that the two samples differ only as a result of chance fluctuation. When this alternative is rejected, it is surmised that the reason the samples differ is the reason put forward by the investigator (ie, that the difference is caused by the difference in, for example, treatments between the treated and control groups). It is assumed that all other relevant variables have been held constant. This is not always true. There may be more than one difference between the groups. This is a variant of the UFO fallacy; the statistical test assumes there are only two possibilities that are mutually exclusive and exhaustive. In reality, there may be other possibilities.

Sample versus target populations In evaluating research reports, one must always bear in mind that there are two distinct populations: the target population, about which the investigator wishes to draw a conclusion; and the sampled population, from which the sample was actually taken. The possibility of bias is such an important consideration that it bears repetition. When clinicians report the results of various therapeutic procedures, a question that immediately comes to mind is: How representative of the general population (or, even more cynically, of all of the patients treated by the clinician) are these particular patients? Selective factors may operate to cause the two populations being compared to be different.

Unreliability of observation Some techniques do not measure what they claim. Other techniques are inherently unreliable. Some investigators are habitually incompetent. All of these factors can produce unreliable results. For this reason, if you are suspicious of or unfamiliar with a particular technique on which a scientific paper depends, it is worthwhile to talk with someone who has used it. It might also be helpful to conduct a citation analysis of the publication in which the technique was originally described.

307

Brunette-CH21.indd 307

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

Sampling fluctuations (type I error)

d = 0.8

This is discussed in chapter 10.

Real difference between the populations and hypothesis In evaluating this possiblity, two questions arise: the importance of the finding and the size of the effect. The importance of the findings is best assessed by an expert or by citation analysis. The size of the effect can be calculated using several indicators; one is the d statistic. When the null hypothesis is rejected, it is often of interest to calculate the effect size. Cohen8 has devised methods for evaluating the size of an experimental effect for a number of different experiment designs. The effect-size index d for a comparison between means of two groups is calculated as follows: dS =

Xa – X b sp

sp =

(n a – 1)s2a + (n b – 1)s2b na + n b – 2

where ds = effect-size index for means in a standard unit na = number in sample of group A sa = SD of sample of group A nb = number in sample of group B sb = SD of sample of group B – = mean for group A X a – Xb = mean for group B Note that the d statistic is related to the t statistic simply by the following formula: d=

na + n b ×t na × n b

A convenient way of thinking of these values is in terms of population overlap (see appendix 7), which is given here as the percentage of the area covered by both populations that is not overlapped. When: d=0 d = 0.1 d = 0.2 d = 0.5

0% not overlapped; one population is perfectly superimposed on the other 7.7% not overlapped 14.7%; eg, height differences between 15and 16-year-old girls 33%; eg, height differences between 14- and 18-year-old girls

d= d= d= d=

1.0 2.0 3.0 4.0

47.4%; eg, height differences between 13and 18-year-old girls, or mean IQ difference between PhDs and typical college freshmen 58.9% 81.1% 92.8% 97.7%

Cohen8 has calculated effect-size indicators for other statistical tests. Finally, truly large differences do not require statistical tests. Most people will believe that a difference in size exists between apples and pumpkins without the calculation of the d statistic. Indeed, it has been argued that any effect so small that it requires statistics for its demonstration is not important. A quote attributed to Lord Rutherford states, “If your experiment needs statistics, you ought to have done a better experiment.” There is an element of truth in such an argument—particularly in laboratory studies in which investigators can manipulate several factors to demonstrate an effect or reduce the variability in the groups. Nevertheless, most dental research requires appropriate statistical analysis. What is regarded as an acceptable size of an experimental effect differs according to field. In behavioral science, Cohen8 has defined d values of 0.8 as being large differences. Such a difference would probably not be regarded as large in biologic studies. Nevertheless, small effects can be important, particularly in situations where large numbers of people or large amounts of money are involved.

When the null hypothesis is accepted As noted earlier, it is relatively rare to see H0 accepted; such a negative result may mean only that the techniques used to demonstrate the effect were not sensitive enough. One should note that the test does not say H0 is true. It merely says there is no significant difference, or, expressed alternatively, either H0 is true, or there is not enough information to prove it false. In an assessment of 71 negative randomized controlled trials (RCTs), 50 of the trials used a sample size so small that the trials could have missed a 50% improvement. Many of the therapies labeled as no different from control had not received a fair test.15 A quality assessment of RCTs in dental research found that none of the 17 studies reporting no significant differences calculated a type II error to ensure that a sufficient sample size had been used.16

308

Brunette-CH21.indd 308

10/9/19 12:02 PM

Other Statistical Concepts

In a study of psychologic journals, Sterling17 found that investigators generally do not publish experiments that do not show significant differences at the 5% level or better. This is not surprising, because negative results—ie, reporting that the experimental conditions had no effect—could always be interpreted as meaning that the methods used were not sensitive enough to detect any effect. Such studies do not often end up in good journals. (In Sterling’s report, about 3% of the articles did not reject the null hypothesis, while 97% did at the 5% level.) Although I have no precise data on the dental literature, my impression is that acceptance of the null hypothesis is not so rare. Nevertheless, suppose that an idea for a given treatment occurred to 40 investigators. Next, suppose that the treatment, in reality, did not have any effect. It is likely that 2 of the 40 would get significant results at the 5% level. The first would publish the result, and the second would confirm it. A nonexistent effect could become enshrined in the literature without being refuted because of the other 38 authors’ reluctance to publish negative results.7 The 38 investigators who do not publish their results exemplify what has been called the file-drawer problem. Studies that do not show a statistically significant effect are not published, but rather are placed in a file drawer. In trying to interpret the generality of results—for example, of the effectiveness of some form of therapy—reviewers have to guess at how many file-drawer manuscripts exist. One method is the funnel plot,5 a scatterplot that is widely used in meta-analysis and systematic reviews. In this plot, each study on the question of interest is plotted on a graph where the x-axis is the reported size of the effect of a study. The y-axis is the precision of the study by such measures as standard error or the number of subjects. If no bias was operative, one would expect the studies of high precision to cluster around the average, whereas the less precise study results would be scattered widely but symmetrically around the average effect. The distribution would roughly resemble an inverted funnel centered on the average effect. An asymmetric distribution, such as when there are fewer lower-effect-size points, suggests the possibility of publication bias or a systematic difference of some kind between the low- and high-precision studies. In summary, H0 (ie, the no-difference hypothesis) can never be proven because subsequent experiments with more replicates or more accurate measurements might show that a treatment does have an effect and so disprove the hypothesis. This reasoning leads to the statistician’s common caveat against interpreting a

lack of statistical significance as counterevidence. This “weakness” in statistical reasoning—ie, that statistics can disprove H0 (by showing that it is highly improbable that H0 is true) but cannot prove H0—has been discussed in detail by Bakan.7 Sometimes it is to the investigator’s advantage to accept the null hypothesis for at least a part of a study. For example, if an experiment design requires two or more clinicians to make observations, it is common to compare them to determine if they differ significantly in their assessments. In such cases, it simplifies the analysis and interpretation of the data if it can be assumed that the clinicians do not differ significantly. In such papers, authors are glad to accept the null hypothesis.

Choice of computational unit Experimental units can be considered at various levels, such as sites, half mouths, or patients. Blomqvist18 has recommended that the highest-level unit should be used because, in statistical inference, use of a lower-level unit will underestimate the SE and level of significance. This was evident to Mainland,19 who used a dental example in his classic text as follows: Let us suppose that a dentist, having attended to the teeth of two boys, instructs one of them to use Toothpaste A and the other to use Toothpaste B, and ensures by parental cooperation that his instructions are carried out. After a certain length of time the dentist finds that the boy who used Toothpaste A has eight carious teeth, whereas the other boy has no caries. In terms of numbers of teeth, this looks like an impressive difference, but we need no profound knowledge of dentistry or of statistics to realize that it provides no adequate evidence that the difference in toothpaste was responsible. Persons differ in their tendency to develop caries; therefore the individual teeth in any mouth do not provide independent pieces of information about the effect of a toothpaste. It is boys, not teeth, that are the sampling units, and there is only one sampling unit in each of the A and B samples and no true replicates, that is, sampling units that receive the same test treatment but are otherwise independent. In this simple case, the point is obvious, but it was not obvious to a distinguished worker in nutrition

309

Brunette-CH21.indd 309

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

and dentistry who reported on the caries in 36,196 teeth in the mouths of 1,870 children. By examining about 20 teeth per child, the investigator has measured over and over again the same tendency (or resistance) to caries, but in the analysis each tooth was counted as if it gave an independent piece of information. The error—the error of wrong sampling units—can also be called spurious enlargement of samples, or spurious replication, or counting the same thing over again. In the large dental caries study, the proper sampling units were children, and one way to express the information would be by the numbers of children with, and without, caries (and this was done in another part of the report). A finer measure would be the number of carious teeth per child, with some form of adjustment for the number of filled and missing teeth.

This issue of sampling units has caused debate in dental science. At the 1983 Gordon Conference on periodontal disease, some clinicians argued that because periodontal disease was a localized lesion, each site could be treated as though it were independent. The statisticians echoed the arguments made by Mainland—namely, that the sites were not independent and that the statistical methods should be modified. The debate continues, but the statisticians are winning. Fleiss and Kingman20 state, “Theory and data both indicate that it is a mistake to employ statistical procedures that take individual sites as the unit of analysis; it is the patient who must be the unit of analysis.” Sophisticated methods have now been developed for determining and selecting the optimum number of sites and patients for clinical studies in dentistry.21 A related problem that is a frequent source of error is the choice of a unit that solves a problem related to, but different from, the one under investigation. Schor22 points out that in contraception research, a reduction in sperm count does not necessarily mean a reduction in fertility. If you were interested in the effects of a drug on fertility, the appropriate unit would not be the number of sperm in a sample of semen, but the percentage of men whose sperm count was reduced to subfertile range.

Violation of assumptions The debate about independent sampling units illustrates that many assumptions underlie statistical tests. Studies of the medical research literature have found that statistical tests have been applied inappropriately to data because the assumptions underlying the tests were probably false.

Errors in analysis and spotting faked data This group of problems is concerned with errors in the statistical analysis of data. A large number of these, including mistakes in computation, would be expected to be hidden from the reader because the raw data often are not reported. However, sometimes errors in analysis are evident even at the level of simple arithmetic computation. An example is furnished by a paper submitted to the Journal of Experimental Medicine by Summerlin, who later admitted to falsifying data.23 Summerlin performed transplants on six groups of animals; each group had 20 animals. Summerlin reported the percentage that were not rejected in each group. Because any nonrejection is a discrete event, the percentage had to be a multiple of five: 1⁄20 = 5%, 2⁄20 = 10%, etc. The percentages Summerlin recorded were 53, 58, 63, 48, and 67. An alert referee should have noticed that something was wrong with Summerlin’s data. In 1881, astronomer Simon Newcomb observed that the earlier pages of logarithm tables showed more wear than the later pages in the books, and concluded that, for natural observations, the first significant number is more often the number 1 than it is 2; the number 2 is more frequent than 3; and so forth.24 This pattern was later confirmed by Benford, a physicist at General Electric, who analyzed some 20,229 observations; the first significant digit rule eventually became known as Benford’s Law, given by the following: The probability P that the first significant digit (D1) equals a particular value d (where d = 1, 2, 3, … 9). P(D1 = d) = log10(1 + 1/d) Benford’s Law was an empirical affair until 1991, when it was given a rigorous mathematical foundation by Georgia Tech mathematics professor Theodore Hill. Today it is widely used to identify fraudulent data (or unintentional errors) not only in naturally occurring data but also in financial data. Frey24 provides a history of Benford’s Law, some applications, and even

310

Brunette-CH21.indd 310

10/9/19 12:02 PM

Now, Something Completely Different: The Bayesian Approach to Induction

references where one can find Mathlab code to do calculations on Benford’s Law.

A Final Warning and a Set of Rules Statistics are powerful tools for estimating the effects of random error. However, conclusions based on statistical inference should not always be accepted at face value but, like any other technique, should be viewed critically. Statistical analysis is not sacred but based on empirical findings (such as distributions of data) and assumptions that are sometimes dubious. To quote Epstein: “The field of statistics bows to no master for ability to furnish subterfuge, confusion, and obscuration.”25 The foregoing principles refer to only a small fraction of the possible problems that can be considered in statistical analysis. Indeed, despite the highly sophisticated types of statistical analysis that are now performed, there is ongoing debate on the foundations of statistical inference.26 It is not uncommon for statisticians to disagree. There are journals that deal solely with biostatistics, suggesting that many problems related to analyzing biologic data remain. Despite the sophistication found in some studies in dental research, there are numerous articles that use inappropriate or less-than-optimal statistical methods.27 Because it is not conceivable that every student will become an expert in biometrics, I offer the following rules: 1. Always apply the golden rule: Examine every sample for size, diversity, and randomness. 2. Ask if the data should be evaluated with statistical techniques. Statistical testing is sometimes absent when it should be present. Dental journals still publish papers that claim differences in treatments or effectiveness of treatments in the absence of statistical testing. Readers can either perform the statistical test themselves or simply disregard the paper because nothing has been proved. A brief, excellent guide to the selection of statistical tests and a readable, nontechnical explanation of the most common techniques has been published by Norman and Streiner.28 3. If a simple statistical test was used, determine whether it was applied appropriately. 4. If unusual statistical tests were applied, there are two choices:

a. If the paper is important to you, look up the test in a statistics textbook. Textbooks such as those by Norman and Streiner28 and Zar29 give straightforward explanations of the conditions under which particular tests are applicable. This may give you more insight into the reliability of the data, or perhaps the authors’ desperation to get a significant result. b. If the paper is only of marginal interest, assume that the authors chose the right test for the application and accept the result of the test as given, and make a note of your reservations in the event that the paper becomes more important to you. 5. For all papers, try to formulate alternative hypotheses that can also explain the results.

Now, Something Completely Different: The Bayesian Approach to Induction The logic of statistical inference presented previously is called the Neyman-Pearson approach, but it also might be termed classical statistics. This standard approach has several logical problems.30,31 First, the standard approach evaluates evidence according to the intentions of the investigator. Investigators have to decide in advance what comparisons they are going to make, and whether they will be testing for greater-than or less-than differences (ie, a one-tailed test), or any differences at all (ie, a two-tailed test). Intuitively, it is difficult to understand how the intentions of the investigators bear on the assessment of the actual evidence, but the standard approach decrees that the investigator’s intentions matter. Thus, one could argue that the standard approach is not truly objective because it incorporates not only the objective data but also the subjective intentions of the investigator. A second problem is that the theory considers evidence that was not actually obtained. As illustrated in the example from chapter 10, in computing the probability of obtaining a certain number of successes for UBC dental graduates, we added the probabilities of the more extreme values as well. For example, for Dentist C, who had 13 successes, we added the probabilities for 1, 2, 3, 4, . . . 12, 13 successes. But these events—eg, 12 successes—were never observed for C; they just had a

311

Brunette-CH21.indd 311

10/9/19 12:02 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

possibility of being observed. Some statisticians argue that the probability of data that have not been observed is irrelevant in making inferences from an experiment based on actual observations. A third problem is that researchers utilizing the standard approach end up testing the most untenable and outlandish null hypotheses. Researchers who are aware of differential effects have hopelessly inappropriate means of quantifying them; thus, they are reduced to implying that a result giving P < .000001 is more important than one for which P < .0132. A fourth problem is that science should be cumulative. It would be an unproductive strategy to ignore the huge amount of available data and design experiments in which the null hypothesis is likely to be true. Instead, scientists try to perform experiments with high probabilities of revealing new information. As noted by Ross,32 the hypothesis of no effect commonly used in hypothesis testing is correct for improbable claims; however, it is not correct when investigators have a body of knowledge to draw from and a theory designed to predict not just that something will happen but what and how much. The Bayesian approach to statistics is named after the Reverend Thomas Bayes, a nonconformist clergyman who proposed the basic ideas in 1763. Bayesian analysis has travelled a long and difficult road toward acceptance by the statistics community that has been chronicled in the book The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy by Sharon Bertsch McGrayne.33 In Bayes’ method, we start with a subjectively determined prior probability for a hypothesis. After obtaining evidence that bears on the hypothesis, we calculate the likelihood of obtaining the observations if the hypothesis is true. Then we combine the subjective prior probability with the likelihood to obtain a posterior probability of the hypothesis being true. The net effect of the calculations is that the evidence alters your assessment of the probability of the hypothesis being true. Such an approach has intuitive attraction because it mimics the way rational people commonly learn and make decisions, as well as the way health professionals modify their decisions as they gather more information about their patients. As David Hume (1711–1776) put it in section X “Of Miracles” in his An Enquiry Concerning Human Understanding, “A wise man proportions his belief to the evidence.”34 Our opinion should change as the evidence changes. However, Bayes’ theorem does not specify that one has to start with a purely subjective hypothesis or a

noninformative hypothesis; one can start with a probability based on some experience. In the instance of this example, the posterior probability calculated in the first cycle based on pure subjectivity and relative ignorance could be used as the starting prior probability for a second experiment. After successive iterations, one would expect the posterior probability to approach the truth. Thus, the Bayesian approach is cumulative and reflects the real world of science more accurately than the standard method of hypothesis testing. The main criticism of the Bayesian approach is that the best assessment of the first prior probability is subjective, while science is an objective business. However, theoretical work has shown that prior probabilities normally have relatively little influence on the corresponding posterior probabilities, and what influence they do have diminishes rapidly as evidence accumulates. The Bayesian approach applied to clinical diagnosis has been illustrated in chapter 15 and is the application most familiar to medical and dental clinicians. However, the Bayesian approach has numerous other applications. In calculating probabilities, as in chapter 15, for presence of a disease, you may have noticed that discrete values were used. This simplifies the calculation. But in many situations, there might be a range of values that might be possible and compatible with available data. For example, one might think that the probability of a “heads” on a coin flip might be taken to be 0.5 and that repeated trials of say, 100 flips, would be centered on a distribution plotting probability density versus P at a value of P = .5 (ie, 50/50 equal chance of heads or tails). But in the skilled hands of a professional gambler, a coin flip might not be fair, and some other value of P might be more appropriate. To test this idea in a Bayesian fashion, one would first postulate your prior distribution based on your current belief. Analogous to classical frequentist statistics, the distribution would be specified as a probability density plotted against the value of P. In such a curve, the area under the curve over a particular interval, for example, P between the values of 0.1 and 0.19, would yield the P value for that interval. Moreover, as usual for probability density plots, the total area under the curve must equal 1. Although in theory one could construct a completely subjective distribution by recording your subjective probability for each interval of the probability scale, subsequent calculations are going to be more convenient if one describes the distributions by a well-characterized distribution. In Bayesian inference, problems of this type are described by the beta (β) distribution, which has some similarities to the binomial distribution that

312

Brunette-CH21.indd 312

10/9/19 12:03 PM

References

would be used by classical frequentists for dealing with coin flip calculations. The equation for the β distribution and worked-out examples as well as extended discussion are given in more detail by Pruzek,35 which has been praised by Harlow36 as a very clear description of Bayesian inference. In any case, after the construction of the prior distribution, the Bayesian approach involves the collection of new data. This new data would be used to generate a likelihood distribution by plugging the observed values in the equation for a β distribution, which would then be used to modify the prior distribution to produce a posterior distribution. Conveniently the mathematics are such that the parameters for the posterior distribution are simply additive constants of the prior and likelihood constants. So similar to the diagnosis calculations, the prior probability has been modified by likelihood data to produce a distribution of posterior probabilities. The posterior distribution can then be explored to come to an understanding of the experimental data and its relationship to the prior and thereby modify the initial prior’s subjective beliefs. A good web-based tool for exploring these relationships can be found at bayesweb.com.37 The calculations underlying Bayesian inference in the simple coin flip model become much more complex when applied to more complex models and data. In those instances, it is desired to calculate the probability of parameters’ values and model structure given the data. Moreover, it will be of interest to compare different models that are mathematical descriptions of probabilities. A particular difficulty is to calculate the probability of getting the observed data that involves summing across all possible parameter values weighted by the strength of belief in those parameter values. This calculation involves computing what are often difficult integrals, and Bayesian statisticians resort to numeric approximations. A favored approach is the random sampling method known as Markov chain Monte Carlo (MCMC). The development of MCMC methods has been credited by Kruschke as allowing Bayesian statistical models to gain practical use.38 John K. Kruschke’s text Doing Bayesian Data Analysis38 provides examples of the calculations using the R programming language. The R language is widely used for performing Bayesian statistics, and abundant resources are available. Krushchke’s text includes the R code for various examples, and there is also supplementary information on his website. Moreover, the text includes Bayesian analogs of the standard null hypothesis tests that range over a wide range from the

t test to multifactor analysis of variance and logistic regression. It is, however, evident that learning these procedures is not for the mathematically faint of heart, and a wise option for scientists and clinicians unfamiliar with the procedures, as for complicated tests for classical statistical tests, may be to consult statisticians with appropriate expertise. Given the difficulties of calculation involved in the Bayesian approach, the additional effort can be rewarding because Bayesian statistics offers significant advantages for experimenters in many important aspects. Perhaps the foremost exponent of Bayesian techniques in biomedical research, Donald A. Berry, has outlined and exemplified the usefulness of the Bayesian approach in a lecture available online.39 Berry states the top five reasons for the Bayesian approach: (1) online learning, so one can adapt a design while the experiment is in progress so that, for example, a more efficient use of patients can be obtained; (2) predictive probabilities (for example, enable investigators to terminate experiments ahead of schedule when a finding has been sufficiently demonstrated); (3) hierarchical modeling; (4) modeling more generally; and (5) decision analysis (of obvious importance in treatment planning). Bayesian methods have found application in dental research,40–42 and one hopes that the approach will become more popular as investigators and journal referees become more accustomed to it.

References 1.

Carlyle T. Statistics. In: Carlyle T. Chartism Past and Present. London: Chapman Hill, 1858. 2. Scheutz F, Andersen B, Wulff HR. What do dentists know about statistics? Scand J Dent Res 1988;96:281. 3. Huff D. How to Lie with Statistics. New York: Norton, 1954. 4. Beecher HK. Pain, placebos and physicians. Practitioner 1962;189:141. 5. Light RJ, Pillemer DB. Summing Up: The Science of Reviewing Research. Cambridge: Harvard University, 1984. 6. Keene HJ, Shklair IL, Anderson DM, Mickel GJ. Relationship of Streptococcus mutans biotypes to dental caries in Saudi Arabian naval men. J Dent Res 1977;56:356-361. 7. Bakan D. On Method. San Francisco: Jossey-Bass, 1967. 8. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Academic, 1977. 9. Dao TT, Lavigne GJ, Feine JS, Tanguay R, Lund JP. Power and sample size calculations for clinical trials of myofacial pain of jaw muscles. J Dent Res 1991;70:118. 10. van Belle G. Sample size. In: Statistical Rules of Thumb. New York: Wiley, 2002:19,29–51. 11. Murphy EA. A Companion to Medical Statistics. Baltimore: Johns Hopkins University, 1985.

313

Brunette-CH21.indd 313

10/9/19 12:03 PM

STATISTICS AS AN INDUCTIVE ARGUMENT AND OTHER STATISTICAL CONCEPTS

12. Michael M, Boyce WT, Wilcox AJ. Biomedical Bestiary. Boston: Little Brown, 1984. 13. Sackett DL. Bias in analytic research. J Chronic Dis 1979;32:51. 14. Norwegian Multicenter Study Group. Timolol-induced reduction in mortality and reinfarction in patients surviving acute myocardial infarction. N Engl J Med 1981;304:801. 15. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. N Engl J Med 1978;299:690. 16. Antczack AA, Tang J, Chalmers TC. Quality assessment of randomized control trials in dental research. II. Results: Periodontal research. J Periodontal Res 1986;21:315. 17. Sterling TD. Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J Am Stat Assoc 1959;54:30–34. 18. Blomqvist N. On the choice of computational unit in statistical analysis. J Clin Periodontol 1985;12:873. 19. Mainland D. Elementary Medical Statistics. Philadelphia: Saunders, 1963. 20. Fleiss JL, Kingman A. Statistical management of data in clinical research. Crit Rev Oral Biol Med 1990;1:55. 21. Hujoel PP, DeRouen TA. Determination and selection of the optimum number of sites and patients for clinical studies. J Dent Res 1992;71:1516. 22. Schor SS. How to evaluate medical research reports. Hosp Physician 1969;5:95–99. 23. Moore DS, Notz WI, Nester DK. Do the numbers make sense? In: Moore DS, Notz WI, Nester DK. Statistical Concepts and Controversies, ed 6. New York: Freeman, 2006:154–162. 24. Frey B. Statistics Hacks. Sebastopol: O’Reilly, 2006. 25. Epstein RA. The Theory of Gambling and Statistical Logic. New York: Academic, 1967. 26. Godambe VP, Sprott DA (eds). Foundations of Statistical Inference. Toronto: Holt, Rienhart and Winston, 1971. 27. Emrich LJ. Common problems with statistical aspects of periodontal research papers. J Periodontol 1990;61:206. 28. Norman GR, Streiner DL. PDQ Statistics. Toronto: Decker, 1986.

29. Zar JH. Biostatistical Analysis, ed 2. Englewood Cliffs: Prentice Hall, 1984. 30. Urbach P. Clinical trial and random error. New Sci 1987;116:52–55. 31. Berger JO, Berry DA. Statistical analysis and the illusion of objectivity. Am Sci 1988;76:159. 32. Ross J. Misuse of statistics in the social sciences. Nature 1985;318:514. 33. McGrayne SB. The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines and Emerged Triumphant from Two Centuries of Controversy. New Haven: Yale University, 2011. 34. Hume D. Of Miracles, with an Introduction by A. Flew. La Salle, IL: Open Court, 1985. 35. Pruzek RM. An introduction to Bayesian inference and its applications. In: Harlow LL, Mulaik SA, Steiger JH (eds). What If There Were No Significance Tests? Mahwah, NJ: Lawrence Erlbaum Associates, 1997:287–319. 36. Harlow LL. Significance testing introduction and overview. In: Harlow LL, Mulaik SA, Steiger J (eds). What If There Were No Significance Tests? Mahwah, NJ: Lawrence Erlbaum Associates, 1997:2–17. 37. BayesWeb. http://bayesweb.com. Accessed 11 June 2018. 38. Krushke JK. Doing Bayesian Data Analysis. Amsterdam: Elsevier, 2011. 39. Berry DA. Bayesian and related methods. In: The Science of Small Clinical Trials Lecture Series. https://videocast.nih.gov/Summary.asp?Live=8497&bhcp=1. Accessed 11 June 2018. 40. Gilthorpe MS, Maddick IH, Petrie A. Introduction to Bayesian modeling in dental research. Community Dent Health 2000;17:218–221. 41. Thoden van Velzen SL, Duivenvoorden HJ, Schuurs AHB. Probabilities of success and failure in endodontic treatment: A Bayesian approach. Oral Surg Oral Med Oral Pathol 1981:52:85–90. 42. Hsiao CK, Chen PC, Kao WH. Bayesian random effects for interrater and test-retest reliability with nested clinical observations. J Clin Epidemiol 2014;64:808–814.

314

Brunette-CH21.indd 314

10/9/19 12:03 PM

22 Judgment

Clinical Versus Scientific Judgment Clinical judgment can be thought of as a combination of knowledge, skills, and abilities based on personal experience and reading that allows a clinician to make rational diagnostic and therapeutic decisions even though the available information is incomplete and uncertain. Clinical judgment requires the ability to choose relevant data from a larger set, formulate an overall view of the clinical problem, develop a plan to evaluate the clinical problem, monitor and modify the plan as needed, and determine what additional resources are needed.2 In medicine, clinical judgment is developed by exposure to many patients with a wide variety of diseases and presentations, experience in treating patients and following their courses, reading the literature, evaluating the validity of decisions by trial and error, and interactions with peers and superiors by discussion as well as by attendance at lectures, seminars, and professional meetings.3 One problem with the acquisition of clinical judgment is that clinical experience is often gained in a haphazard manner because it depends on the problems of the patients that show up for treatment. Moreover, typically, a clinician must make judgments rapidly. On the other hand, scientists are concerned with exact and systematic investigation. Academic science is a social institution devoted to the construction of a rational consensus of opinion over the widest possible field4—a time-consuming process. Thus, scientists can keep their minds open if the information is not definitive, whereas clinicians have to act. However, both groups face some common problems in forming their judgments.

Critical Thinking As Applied to Scientific Publications There are standards related to scientific writing and argumentation used to evaluate critical thinking, including the assessment of scientific publications.



Systematic decisions based on a few explicable and defensible principles are superior to intuitive decisions— because they work better, because they are not subject to conscious or unconscious biases on the part of the decision maker, because they can be explicated and debated, and because their basis can be understood by those most affected by them.” ROBYN M. DAWES 1

315

Brunette-CH22.indd 315

10/9/19 12:05 PM

JUDGMENT

Standards for scientific writing • Clarity. Authors must make their points understandable to the reader. A good way to do this is to present the material so that it matches reader expectations; Gopen and Swan5 provide an excellent introduction to this topic for writers of science in their article “The Science of Scientific Writing.” • Accuracy. The criterion of accuracy applies not only to measurements but also to references to the literature. For example, have the cited authors’ views been accurately portrayed in the discussion section, or have they been subtly shifted so as to align with the views of the writers? • Precision. Scientific writing should be precise with no ambiguity and little wiggle room. The fallacy of equivocation should be avoided. Scientific writers should not emulate the famous oracles at Delphi, whose predictions were worded so that they could be construed to be true no matter what events came to pass. Similarly, scientific measurements should be sufficiently precise so they can be used to distinguish between hypotheses. • Relevance. Both the introduction to and the discussion of scientific papers should include only materials relevant to the conclusions. Mere parading of knowledge by incorporating extensive references irrelevant to the problem under investigation wastes readers’ time and tests their patience. • Depth. Scientific papers should explore the topic to a depth appropriate to the standards of the field in which they are published. Thus, a clinical study on the effects of chlorhexidine published in a dental journal would include discussion of the clinical measurements but would not include molecular orbital calculations on the molecule. The article’s depth will be determined largely by the standards of the journal in which it is published. • Breadth. The authors should consider the points of view of other investigators in the topic area, as well as relevant insights from related areas. • Logic. Readers who labored through the early chapters of this book need no more preaching on the importance of logic. Here, I will merely mention that writers on critical thinking find logic indispensable. • Significance. Because readers invest their time in understanding a scientific publication, it is only fair that they receive something of value in exchange. Thus, papers should present meaningful information that can impact the research, understanding, or

clinical practice of the reader. Assessing significance is a difficult task, particularly soon after publication; however, over time, impact can be assessed by tools such as citation analysis.

Putting It All Together: Argumentation Maps At the end of the day, investigators have to consolidate their diverse set of observations, literature references, and arguments into cogent conclusions. Often, more than one set of observations contribute to the conclusion. This situation is exemplified in Fig 22-1, which shows an argumentation map that my former postdoctoral fellow Douglas Hamilton and I constructed in preparation for writing a paper subsequently published in Biomaterials.6 The paper concerned the response of osteoblasts to the specific topographic features of the surface on which they were cultured. Among the responses examined were cell signaling cascades— inherently complex pathways, whereby cells transmit signals received from the environment into appropriate actions (such as activating specific genes). The experiments tend to be similarly complex, involving multiple procedures such as immunostaining, Western blots, and inhibitors of various pathways. Despite the complexity, the results must be clear to referees and editors. An argumentation map provides a way of clarifying matters for the authors, but it may also be used to understand the arguments of others. Argumentation maps were developed by Horn7 as an aid in teaching philosophy and resolving or clarifying real-world debates. Related approaches are mind mapping and concept mapping, which is part of a learning movement called constructivism. Developed by Novak and Gowin8 in the 1970s, concept mapping visualizes relationships between different concepts. Not all of the features of argumentation maps are used in Fig 22-1, but the main feature employed is that a claim (ie, a conclusion) is placed in what is called a focus box. In this case, the focus box is the paper’s conclusion: “Substratum-induced topographic activation is mediated by the FAK-ERK complex.” Surrounding the focus box are the results of specific experiments (eg, “P-ERK levels increased on grooves and DES”), which I will call data boxes. The finding in the “P-ERK” data box supports the conclusion by providing evidence for the involvement of ERK. That support is indicated by a link (in effect, an arrow connecting the boxes, which

316

Brunette-CH22.indd 316

10/9/19 12:05 PM

Putting It All Together: Argumentation Maps

Inhibitor studies

Rebuttal

Possibility that as many proteins are activated, there may be more than one mechanism of activation

Extensive literature

Caveat General rise in no. of proteins and amount of p-Tyr on grooves and DES

Support for existence of FAK-ERK complex Evidence for cell activation Conclusion: Substratum-induced topographic activation is mediated by FAK-ERK complex

P-ERK levels increased on grooves and DES

Evidence for role of ERK

Evidence for role of FAK

FAK levels increase and distribution is altered on grooves and DES

Indirect evidence for role of ERK Evidence for role of ERK An Src inhibitor PP2 inhibits cell activation on grooves and DES

Caveat

True for all times?

ERK translocated to nucleus on grooves and DES

Caveat

Not quantified

Rebuttal SOF: Immunostaining seldom quantified

Caveat

Specificity of PP2?

Caveat 2

Why Piceatannol gives same pattern?

Fig 22-1 An argument map for a conclusion in a cell-signaling study. Various types of evidence support the conclusion, and the nature of the support is indicated on the linking arrows. Caveats to the conclusion and possible rebuttals may also be mapped. DES, discontinuous edged surfaces; ERK, extracellular signal–regulated kinase; FAK, focal ahesion kinase; SOF, standards of the field.

317

Brunette-CH22.indd 317

10/9/19 12:05 PM

JUDGMENT

includes the relationship between them). In this case, the link is labeled “Evidence for role of ERK.” (In fact, more than one link is labeled this way, as is not uncommon in science, because there are multiple ways of demonstrating a relationship.) A second group of links are labeled caveats, or warnings of possible problems with the link. The box connected to the “P-ERK data box” by the caveat link is inscribed “True for all times?” This box reminded Hamilton and me that a referee might want more data at different times. If that request had been made, we would have had little choice but to comply. Another caveat-linked box, inscribed “Possibility that as…,” is linked to the data box “Inhibitor studies,” by a link labeled rebuttal. In other words, we thought the inhibitor studies refuted the possible criticism made in the “Possibility…” box. The rebuttal-linked box “SOF” constitutes another means of rebuttal, invoking the current standards of the field (SOF). In other words, were a referee to criticize on the basis of the “Not quantified” box, we could argue that most investigators do not quantify immunostaining (in terms of intensity of the fluorescence). The SOF argument is one that might or might not work. You will recall from chapter 1 that the philosopher Fisher advocated the importance of having realistic standards. Normally referees are realistic because, like others, they do not want to face unreasonable standards when they submit their own papers for publication. Indeed, one sign of personal hostility to which scientists are acutely sensitive is unrealistic demands made by a referee or editor. Observing the influence principle of liking (discussed in chapter 4), alert scientists will go to some trouble to avoid having their work evaluated by a hostile person. Thus, scientists often try to surmise who is the source of the comments on their reviewed papers and grants. Another kind of box, not shown in the diagram, is an “information missing” box, indicating that some information normally vital to establishing a conclusion is not currently available. How do you know what you do not know? The key to answering that question is to examine publications on similar topics in the same journal and determine what is normally required, although it must be conceded that identifying such requirements is not always easy. Preparation of an argumentation map simplifies the writing of a paper. For this particular map, we could write, “there are several lines of evidence supporting the view that substratum-induced…,” then go around the supporting evidence boxes sequentially. We would still have decisions to make; for example, should we discuss possible weaknesses in advance, or let the referees find

them out for themselves? An author has an obligation to provide a balanced discussion; for example, the author would have to cite and explain any evidence in the literature contradicting the conclusion. The author also has an obligation to be reasonably concise—no one wants a discussion that examines minor points in mind-numbing detail. Another use of argumentation maps is to assess the evidence for a conclusion in the paper. Sometimes in the dental literature one will find a conclusion supported by no evidence whatsoever in the paper. Some of these phantom conclusions may be platitudes relating to proper technique or widely accepted views. Other conclusions might be poorly supported. These are perhaps the most worrying because if the paper is accepted, they may undergo what I call “conclusion creep” in the author’s subsequent papers, in which the weak sister conclusion transforms itself into a Goliath (albeit one ably brought down by the single stone of Fisher’s assertability question, discussed in chapter 1). By constructing an argumentation map that makes the support for a conclusion explicit, the reader is also able to consider the links more closely, look for missing information boxes, and think of other interpretations of the data. Because of the difficulty of making complex arguments understandable to readers, authors should consider hiring professional editors or medical writers to help them present their arguments in a convincing fashion. After all, the hiring of statisticians to deal with data analysis is commonly accepted, yet statistical considerations are only one part of the total argument. The success of a study in terms of its eventual publication and impact might be well served by investing resources in rhetorical presentation, because readers will be hesitant to cite what they do not understand.

Balanced Judgments A study that contains errors of logic, method, or design is not necessarily useless or unimportant. James Lind studied a group of 12 British sailors with scurvy and divided them into subgroups of 2 that received the following dietary supplements: cider, elixir vitriol, vinegar, sea water, two oranges and one lemon, or the bigness of a nutmeg.9 By current standards of experiment design, Lind’s findings would be considered rubbish. There was no random allocation of subjects; the treatments were not applied blind; there was no

318

Brunette-CH22.indd 318

10/9/19 12:05 PM

Judgments Under Uncertainty, Heuristics, and Cognitive Biases

placebo; only one dose of each agent was used; and there was no statistical analysis. Yet, Lind’s findings revealed a way to prevent scurvy. Similarly, in June 1854, Pasteur’s anti-rabies vaccine was investigated by a committee of scientists appointed by the French government. For Pasteur’s definitive test, he had used only two vaccinated dogs, two untreated dogs, and two rabbits. Consider Ninio’s analysis10 of how a strict reviewer might evaluate the signal papers of molecular biology: The double helix? Our reviewer has found two major faults: The authors wrongly assume the bases to be in the keto form, and they conveniently ignore the fact that the A/T and G/C ratios differ significantly from unity by at least 10%. Furthermore, the authors have no original data to present that would substantiate their speculations. The genetic code? Dr Mathaei and Nirenberg’s work will not be credible as long as the authors persist in using nonbiological templates, at nonphysiological Mg2+ concentration in a system that is five orders of magnitude slower than the cell’s apparatus. . . . Had all the referees done the kind of job they are asked to do, molecular biology would hardly exist. These examples illustrate the principle that assessing scientific papers is not a black or white proposition; defects in one aspect can sometimes be compensated by strengths in another. Thus, while Watson and Crick may not have presented much original data in their Nature article,11 the overall brilliance of their proposal won them a Nobel Prize. Even a humble case report can contain the germ of an observation that could prove valuable for future study. The existence of errors or weaknesses in a study does not mean that the conclusions are necessarily wrong. We should not make the judgment error humorously known as “throwing out the baby with the bathwater.” The problem of assessing a paper, then, is a balancing act, similar to that practiced by Benjamin Franklin (see following chapter) in which the relative merits of pros and cons of the evidence for the claim as well as other factors such as originality and quality of execution are considered and balanced to form an overall judgment.

Judgments Under Uncertainty, Heuristics, and Cognitive Biases In some instances, sound judgments about the scientific merits of a proposal might be of practical importance. For example, the field of resin composites is changing rapidly, and there is believed to be a growth market for esthetic dentistry that uses these formulations. When new formulations become available, dentists must decide whether to adopt them. If a dentist adopts a new product and it fails, he or she may be stuck with the responsibility of replacing failed restorations. If the dentist does not change to a newer formulation that subsequently performs better than others, the dentist may lose patients to another practitioner who offers this better service. A dentist does not necessarily have the luxury of waiting; at a recent conference, it was stated that improved composite formulations are being prepared every 6 months. In short, dentists, like other clinicians, are forced to make judgments about the quality of scientific information under uncertain conditions. While the preceding chapters of this book provide a guide to evaluating scientific information, there is still the mental act of forming a final judgment, which involves balancing the various components of the evaluation process. The process whereby people form judgments under uncertain conditions has been a topic of considerable investigation by psychologists.12 In medicine, there has also been considerable attention to cognitive errors and strategies to minimize them in medical diagnoses.13 Several principles have been developed that explain how people make judgments and errors in judgment. In brief, people often use mental shortcuts or rules of thumb, called heuristics, to form judgments and make decisions. These shortcuts can work well under some circumstances and reinforce users on their value. Not surprisingly, to save time these shortcuts focus on one or a few aspects of a complex judgment problem and ignore others. Also unsurprisingly these focused shortcuts can result in errors, which are called cognitive biases. Wikipedia lists (as of 4/13/2018) 110 biases under the category decision-making, belief, and behavioral biases alone. There are also social biases, memory errors, and biases. Simply put, there are a lot of ways to make bad decisions. Indeed, in the current political climate, participants in the 2017 grant evaluation process of Canadian Institutes of Health Research were specifically warned about implicit bias or the implicit stereotype that is the unconscious attribution

319

Brunette-CH22.indd 319

10/9/19 12:05 PM

JUDGMENT

of particular qualities to a member of a certain social group, in particular females, and were given information to combat this perceived problem. Following are some of the characteristics of human judgment that are particularly applicable in evaluating scientific work.

Affective error We value too highly information that fulfills our desires. Physicians care about their patients and want a good outcome, a condition that can lead them to underinvestigate problems so that they favor diagnoses that appear to indicate good outcomes for their patients, ie, an outcome bias.13

Assimilation bias This bias results from the tendencies on model building and revision noted below. Risen and Gilovich14 point out that even those who consciously fight off the confirmation bias and acquire balanced information suffer from the inclination to interpret ambiguous evidence in a manner that supports an initial hypothesis or supposition. They provide the example of election debates where partisans of both sides come away claiming that their candidate won because they attend to different features of the debate and quite literally see a different contest.3

Belief in the law of small numbers Experimental evidence shows that even scientists are rather insensitive to sample size and are willing to draw recklessly strong inferences about populations from knowledge about even a small number of cases.15 Clinical seminars exemplify this principle when clinicians propose treatments based on their experience with few patients. Commission bias is the tendency toward action rather than inaction. Groopman13 states that this error is more likely to occur with a clinician who is overconfident or desperate and gives in to the urge to “do something.” It can also arise from pressure from a patient, for example, a patient who is suffering from a viral disease and demands treatment with an antibacterial antibiotic. A worry faced by our faculty when fees were drastically increased was that students who graduated with large amounts of debt from student loans would be tempted

and motivated to “do something” to relieve their financial worries even if the patient did not actually require treatment.

Confirmation bias Risen and Gilovich14 class this bias as a fallacy of evidence search and interpretation. It refers to the inclination to recruit and give weight to evidence that is consistent with the hypothesis in question, rather than search for inconsistent evidence that could falsify the hypothesis. This bias can be seen in daily life when left-wing viewers prefer PBS or CBC, whereas those with a right-wing bent prefer Fox News. In both instances, viewers will be exposed to information weighted to their world view and feel comfortable with it. In dental research, there are often proponents for given techniques or products, and even in good faith, these proponents can give a distorted view of their preferred options by not fairly considering the merits of competing procedures or products because their acquisition of information is tainted by confirmation bias.

Greater impact of concrete versus abstract information People are not very good at handling probabilistic data. Information on things such as prevalence of a disease will not have the same impact as one good illustrated example. This has long been recognized as the case; in 1927, Bertrand Russell wrote, “popular induction depends upon the emotional interest of the instances, not upon their number.”4 In experimental tests, where sheer number of instances is pitted against emotional interest, emotional interest has in every case carried the day.4 I have noticed that when clinical scientists give lectures to dental students, the students remember the clinical slides illustrating a single patient at one point of time who had received a particular treatment, but have little ability to recall statistical data that represent many patients and better indicate treatment effectiveness.

Model building and revision Most inferences in everyday life rely on models that are imprecise, incomplete, or incorrect. 1. People commonly overpredict from highly uncertain models.16 This could be called the horse-race

320

Brunette-CH22.indd 320

10/9/19 12:05 PM

Judgments Under Uncertainty, Heuristics, and Cognitive Biases

fallacy, and I am among that overconfident group who become more confident the horse will win after they make a bet on it. 2. It appears easier to assimilate a new fact within an existing causal model than to revise the model in light of this fact. In experimental studies, subjects are reluctant to revise a rich and coherent model, even if the model is very uncertain; instead, they easily use an existing model to explain new facts, however unexpected.10 Thus, theories—like the focal infection hypothesis from dentistry—can linger on even when they are highly improbable.

Problems of causal reasoning There is an irresistible tendency to perceive sequences of events in terms of causal relations, even when the perceiver is fully aware that the relation between the events is incidental and that imputed causality is illusory.17 This statement merely affirms that the post hoc fallacy is common, as manufacturers of over-thecounter medicines are profitably aware. In attributing causes to certain events, people usually assess the degree to which observed behaviors or outcomes occur in the presence of, but fail to occur in the absence of, each causal candidate under consideration.18 Thus, they seem to apply Mill’s canons of induction—particularly the method of agreement. Moreover, in the case of single observations, the assessment strategy involves a discounting principle whereby the observer discounts the role of any causal candidate in explaining an event, so that other plausible causes or determinants can be identified.19 Thus, people adopt the inductively correct policy of considering plausible alternatives. However, in applying these rules, they may make certain errors. For example, there is some evidence to suggest that when people are called on to assess their role in a given sequence of events, they may give themselves more credit for success and less blame for failure than do independent observers.15 This concept would seem intuitively obvious to anyone who has worked in large institutions or who has seen case presentations.

Use of heuristics In assessing probabilities and other estimation tasks, people typically use certain heuristics—that is, exploratory problem-solving techniques and rules of thumb—to reduce complex tasks to simpler

operations.12 These rough and ready ways of thinking can lead to problems. Three examples follow: 1. Adjustment and anchoring. In many instances, people make estimates by starting from an initial value that is adjusted to yield the final answer. Typically, adjustments made to the initial value are insufficient. Thus, to get a good price on the sale of a house, it might be best to price it significantly higher than the market value. This will anchor the value in buyers’ minds, and typically buyers will not adjust sufficiently when they make their offers. Anchoring can distort clinical diagnoses when a clinician doesn’t consider multiple possibilities but rather latches on to a single (perhaps initial) one, and that impression is strengthened by confirmation bias as the clinician attends only to the evidence favoring the anchored possibility.20 Another problem related to consideration of multiple strategies is the search for strategies itself. A practice, termed search satisficing, is the tendency to stop searching for a diagnosis once something, which is not necessarily the optimal answer, has been found.21 2. Availability heuristic. Objects or events are judged as frequent or probable, or infrequent or improbable, depending on the readiness with which they come to mind. Here is an example from cognitive science literature: Is it more likely that a word selected at random from an English text starts with r (eg, rich) or that r is the third letter (eg, car)? People approach this problem by recalling words that begin with r and then recalling words that have r in the third position. Because it is much easier to search for words by their first letter than their third letter, most people judge words that begin with a given consonant to be more numerous than words in which the same consonant appears in the third position. They do this even for consonants such as r or k, which are more frequent in the third position than in the first.12 An example from medicine: The availability heuristic can come into play when there have been numerous similar cases that have been dealt with by the clinician recently. Inner city hospitals, for example, may treat large numbers of alcohol abusers, and a patient with uncontrolled shaking might be classified as having delirium tremens even though there are numerous other possibilities.20 Errors of this type, termed attribution errors, tend to occur when patients fit a negative stereotype. Similarly, clinicians’ judgments can be affected by the “last bad experience,” the vividness of

321

Brunette-CH22.indd 321

10/9/19 12:05 PM

JUDGMENT

which causes it to be prominently available in the clinician’s memory. Potchen22 believes that what most influenced clinical choices was “the last bad experience.” 3. Representativeness heuristic. An object is assigned to one category rather than to another, insofar as its principle features resemble the category. An example used in cognitive science: Two professors of chemistry have a friend who also is a professor and is shy, small in stature, and likes to write poetry. What is this friend’s area of study— Asian studies or chemistry? Most people answer Asian studies, even though at the university there are four times as many professors of chemistry as there are professors of Asian studies, and two chemistry professors are much more likely to know another chemist than a member of any other department. People ignore these probabilities, because the description would appear to match a preconception of an Asian studies professor. They assign too much weight to the representativeness heuristic. An example from medicine: Clinical thinking is often guided by prototypes, and finding one of the characteristics of the prototype in the patient may lead clinicians to fail to consider possibilities that contradict the prototype.

Additional problems with clinical judgment An understanding of clinical judgment is sometimes required in the evaluation of clinical research. Arkes17 described four impediments to accurate clinical judgment in medicine: 1. Inability to assess covariation accurately. Many people overemphasize the percentage of cases in which both symptom and outcome are present. The cases in which the symptom is absent and the outcome is present are not accorded their full importance. 2. Preconceived notions or expectancies. 3. Lack of awareness of factors that influence judgment. 4. Overconfidence by physicians about the accuracy of their judgment, a condition brought about by the following: a. Placebo or Hawthorne effect. Receiving any treatment or attention is often beneficial.

b. Gathering of selective information that tends to support the favored hypothesis. c. Selectively disregarding evidence that contradicts the present judgment. d. Hindsight bias. Many data sets contain observations that may be used to support many different interpretations. When a physician knows that a specific outcome has occurred, the data are analyzed and put together to support that outcome. I call this the Howie Meeker bias. Howie, a former professional hockey player, was the analyst on the CBC hockey telecasts. Howie would show the audience replays of the goals and would comment on what the defensive player should have done better. Of course, when you select goals as an example, the choice of the defender always turned out poorly, but there was no guarantee that the solution proposed by Howie would have had a better outcome. I heard through a friend of the players that some criticized players used to resent Howie’s analysis and nicknamed him “Howie Perfect.” Hindsight is always 20/20 and can be very irritating. Howie had the last laugh, however, winning the Foster Hewitt Memorial Award for “Excellence in Hockey Broadcasting.” Hindsight bias can be rewarding. In reviewing behavioral decision theory and clinical information processing, Dowie and Elstein18 emphasize the following problems: 1. Failure to retrieve correct hypothesis from memory. Aging professors prefer to label this as a senior moment, rather than the F word, failure. Given enough time, they are sure they will retrieve the correct hypothesis. 2. Pursuit of exotic categories at the expense of more probable diseases. There is a medical adage that “When you hear hoofbeats, think of horses, not zebras”—ie, common events are found commonly. (However, fear of falling into this trap can lead to what has been called zebra retreat to describe a doctor’s shying away from a rare diagnosis.23) 3. Misinterpretation of data, whereby the data best remembered are those that fit the hypotheses generated. 4. Poor use of probabilistic information, such as the neglect of Bayes’ theorem in revising probabilities, and confusion of the diagnostic value of a test with its predictive value. Base rates are often

322

Brunette-CH22.indd 322

10/9/19 12:05 PM

Forming Scientific Judgments: The Problem of Contradictory Evidence

Table 22-1 | Review of the use of dentifrices to control dental caries Agent

No. of negative findings

No. of positive findings

Range of caries reduction reported (%)

Urea and ammonium dentifrices

4

7

25–90

Enzymatic dentifrices

3

3

43–53

Antibiotic dentifrices

4

2

26–56

Data from DePaola.27

neglected in favor of individualizing information. This is an example of what has been called the vividness problem.24 In solving problems or making decisions, people are most likely to employ the most readily accessible facts. One factor that strongly affects accessibility is the vividness of the information—and individual cases abound with vivid, concrete information. My coauthor Ben Balevi treated a patient who had a history of recurrent facial swelling that occurred over 10 years and had been treated with antibiotics by five different dentists. Ben was able to diagnose the patient’s condition as Melkersson-Rosenthal syndrome and published a paper on the topic.25 However, even though Ben knew the probability of the syndrome was low, he included it regularly in his differential diagnosis because the case had sensitized him to the possibility. 5. Denial of uncertainty. There are limitations to the certainty that can be offered by even modern medical science and further limitations of the ability of clinicians to master all of it. But there is often a need to act promptly. To resolve this dilemma, clinicians tend to deny uncertainty and gravitate to a position of certainty so that action is possible.

To some extent, this may be the result of the ability of the human mind to accommodate new data into an existing accepted model. Ross and Anderson19 state that professional scientists are not immune to this offense:

From the preceding lists, we can conclude that clinicians have many of the same problems and tendencies with judgment as the general population. A formal approach—evidence-based decision making in dentistry—is given in the following chapter.

The science of reviewing research and combining the results of various studies, a procedure termed metaanalysis, has been summarized by Light and Pillemer.26 Some of the topics in this book give at least a partial explanation for the disagreements in the findings of different studies. Table 22-1, adapted from DePaola,27 summarizes review findings on the results of dentifrice use to control dental caries. All of the agents tested produced conspicuous effects in some trials but little or no effect in others. At first glance, this is depressing; scientific findings should be reproducible, as noted earlier. There are many reasons why exact replication of results does not occur.

Belief perseverance in the face of empirical challenges Beliefs from narrow personal impressions to broader social theories seem remarkably resilient in the face of empirical challenges that seem logically devastating.19

Again and again one sees contending factions that are involved in scholarly disputes, whether they involve the origins of the universe, the line of hominid ascent, or the existence of ego-defensive attribution biases, draw support for their divergent views from the same corpus of findings. These cognitive biases affect scientists as well as lesser mortals, and we can expect authors of scientific articles, suffering from confirmation and assimilation biases, will resort to mental gymnastics to justify their findings and to defend their theories.

Forming Scientific Judgments: The Problem of Contradictory Evidence Differences among studies

323

Brunette-CH22.indd 323

10/9/19 12:05 PM

JUDGMENT

Random fluctuations In a single trial consisting of the five flips of a coin, the odds of getting five heads in a row with a fair coin are (½)5 or 1 chance in 32. This is less than the commonly used choice of 5% or 1 in 20. If the trial flips were done 100 times, it is likely that five heads would occur at least once. This unlikely event of five consecutive heads could appear on the first trial. If we only did a single trial and obtained these results, we might conclude erroneously that the coin was not fair. A similar effect could have occurred in some of the caries studies. Because DePaola reviewed 45 studies, it is likely that in some of the experiments, contrary to the findings, the treatments did not really have any effect, but instead a statistical type I error occurred.

Differences in test conditions A more common problem is that all of the studies would vary somewhat in the method of treatment or the population studied. For studies in dental caries, relevant differences might include location, age range, oral hygiene habits, diet, previous exposure to fluoride, and diagnostic criteria used by the examiners. In illustrating this point, DePaola27 used data to show that two nonfluoridated communities 12 miles apart varied greatly in their dental characteristics. Moreover, he points out that if independent tests of the same agent were performed in each of these locales, it would be hard to imagine that the strikingly different dental characteristics of the subjects in each study would not cause the clinical results to vary. Similarly, the age distribution of the subjects would also be important. For example, unlike in 8- to 10-year-old subjects, the measurement of caries activity in 6- to 7-year-old subjects is dominated by the first permanent molar—a tooth that is notoriously difficult to protect. The size of effect of a protective agent would be expected to change depending on the relative numbers of children in each age group. However, despite these explanations, the discrepancies between various studies are disturbing. The fallibility of a single study demonstrating statistical significance has been emphasized by prominent pioneers in the field of statistics. Tukey28 stated the following: The modern test of significance before which so many editors of psychological journals are reported to bow down owes more to R. A. Fisher than to any other man. Yet Sir Ronald’s standard

of firm knowledge was not one very extremely significant result but rather to repeatedly get results significant at 5%. Fisher29 phrased his criterion for belief as follows: In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. Readers of clinical journals need to keep these thoughts in mind because, not uncommonly, results of a single trial are often reported, perhaps because of the expense of large-scale clinical trials. Yet even some of the most widely accepted dogmas of dentistry fall short of Fisher’s standard. It becomes obvious that continuing work is required to put dentistry on a sound scientific footing. The reproducibility (or lack thereof) of various dental measurements and diagnostic tests is part of the problem. Differences in test conditions and methods of measurement are more likely causes of discrepancies between studies than is type I error.

Discrepancies: Opportunities for research There is a long-standing tradition of discrepancies between observations providing opportunities for research. For example, the inert gases (argon, xenon, etc) were discovered after it was noted that the molecular weight of nitrogen prepared from air (which contained inert gases) differed from that found for nitrogen prepared from ammonia. To return to the example of the efficacy of dentifrices, the discrepancies between studies present opportunities for research. One might want to investigate the possibility that positive or negative findings for enzymatic dentifrices correlate with fluoridation in the test community—or other characteristics, such as age of subjects, or some attributes not presently known to affect dental caries.

Heuristics, biases, System 1 and System 2, and some caveats Kahneman was awarded the Nobel Prize in Economics for his work, often done in collaboration with Amos

324

Brunette-CH22.indd 324

10/9/19 12:05 PM

Forming Scientific Judgments: The Problem of Contradictory Evidence

Tversky, on judgment under uncertainty. So there has been widespread acceptance of the work, and indeed a whole field of behavioral economics has been spawned. But there also have been criticisms. In particular, Kahneman30 as well as others briefly reviewed in Risen and Gilovich14 have devised models featuring two processing systems, System 1 (fast, automatic, effortless) and System 2 (slow, deliberate, lazy) of how the brain works to explain the findings of a vast body of literature. Very briefly, System 1 can be thought of as an associative machine, whereby various elements of thoughts are connected so that each supports and strengthens the other to form a coherent mental picture of the world that results in the pleasant state of cognitive ease. System 1 enables the execution of automatic tasks such as driving a car on an empty road, reading words on large billboards, or detecting hostility in a voice. System 1 is also important for survival; it takes over in emergencies and assigns total priority to self-protective actions. Among many other properties, System 1 is biased towards belief and confirmation. It invents causes, intentions, and intuitions. It generates impressions, feelings, and inclinations. From the preceding you can imagine that System 1 underlies some of the inaccurate heuristics described earlier and can generally be held responsible for our acceptance of informal logical fallacies and incorrect assessments of causation. However, when the picture presented by System 1 is not coherent, a state of cognitive strain is induced that calls upon System 2 to do the heavy lifting. System 2 can construct thoughts in an orderly series of steps and implement rules of thought. System 2 makes deliberate choices and forms explicit beliefs. A key feature of System 2 is that all its operations require attention and are disrupted when attention is disrupted. Sadly, humans have a limited budget of attention, and when we spend beyond our budget we fail. One can do several things at once but only if they are easy and undemanding. System 2 is responsible for the continuous monitoring of our own behavior. System 2 is in charge of self-control. However, some of the phenomena studied by social psychologists, in the words of Collins,31 “haven’t held up through the replications crisis” of the last few years. It appears that more research will be required to determine the precise conditions where the findings can be applied. Nevertheless, it is clear that biases can interfere with effective decision making, and debiasing is an area of active investigation (see for example Croskerry32 or the entry for debiasing on Wikipedia).

It appears that susceptibility to some biases can be reduced by training in statistics or more generally “engaging System 2.”14 Similarly, the need to search for and consider alternative hypotheses, reflect on decisions, and take time (when possible) to consider issues has been urged, and these are part of the core concepts in critical thinking. Croskerry32 for example has devised cognitive debiasing strategies to reduce diagnostic error. That includes considering alternatives; a reflective approach to problem solving and the thinking process; decreased reliance on memory; specific training on flaws in thinking, including instruction in fundamental rules of probability, distinguishing correlation from causation, and basic Bayesian probability theory; minimizing time pressures; as well as other strategies. Another approach is to develop better heuristics that supply some of the benefits of currently used heuristics but are less problematic. Gigerenzer, Todd, and colleagues at the Center for Adaptive Behavior and Cognition are developing simple, fast, and frugal heuristics that make us smart.33 Groopman34 has suggested that heuristics constitute the foundation of all mature medical thinking. Fast and frugal heuristics will be explored briefly in the next chapter.

Meta-analysis and its problems It has been found useful in many instances to rigorously combine the results of different studies to incorporate all of the available information. Clearly, this approach of meta-analysis is most feasible when investigators use similar techniques on similar populations and look at similar outcomes. However, a major problem is that research studies on the same topic can vary in many ways, and the reporting of data can be incomplete. For example, Creugers and Van ’t Hof35 examined 60 publications presenting clinical data. Their aims were to assess an overall survival ratio of three- and four-unit resin-bonded partial dentures and to explore relationships between success factors, including type of retention, cementation material, preparation of the abutment teeth, and location of denture. However, many of the publications did not give sufficient information to address these questions, so that only 14 of the studies were selected for further analysis. It transpired that sufficient information was available to perform analysis on type of retention and location of denture, of which weighted multiple-regression analysis did not reveal a significant effect. This may not indicate that the factors were unimportant but may

325

Brunette-CH22.indd 325

10/9/19 12:05 PM

JUDGMENT

simply reflect the high variability in patient selection, tooth preparation, luting resin cements, and operator experience. The most useful information that came out of the study was that the overall survival curve showed an almost linear slope.

However, most papers do not acquire such a large number of citations, and to interpret the numbers at all requires some background information.

Number of references per paper

Using Othersʼ Judgments: Citation Analysis As a Means of Literature Evaluation No one can be an expert on everything; few people are experts on even one topic. Thus, to some extent, everyone must rely on other people’s opinions. Opinions are not facts but judgments or beliefs based on grounds short of proof, and therefore they can be biased. It is possible that experts with different biases will disagree. A traditional approach to this problem is to collect informed opinions from as many experts as possible. The traditional source that enables us to collect scientific opinion on published papers is the Science Citation Index (SCI). As noted in chapter 5, a number of other sources of this information, such as Google Scholar and Corpus, are now available. The fundamental question we can answer quickly with the SCI is where and by whom a particular paper has been cited. For example, if we were interested in the model system for studying gingivitis introduced by Löe et al36 in the Journal of Periodontology, we could find by means of the SCI that it had been cited 1,939 times as of July 2019. We would also discover who had cited this paper and where they had cited it. There are two ways to use this information: 1. We could read these papers and see exactly what the authors said about Löe et al’s paper. Thus, we would be canvassing 1,282 papers for an opinion. For a highly cited paper like Löe et al’s, this approach would be thorough, but it would require an inordinate amount of work. 2. A second use of the SCI would be to simply count the number of authors who cited Löe et al. This can be instructive because most authors cite publications in a positive sense. Thus, the very fact that a paper (such as Löe et al’s) has been cited frequently probably indicates that the paper has received wide acceptance.

Garfield37 has stated that the number of references per paper is an important factor in citation impact (defined as the number of citations per paper). Papers published in math journals cite an average of eight references, whereas papers published in biochemistry journals have more. For example, the Journal of Molecular Biology cites around 29 references per paper. Clearly, this gives the biochemists an edge in collecting citations simply by virtue of the habits of their field. Studies published in the Journal of Periodontology and the Journal of Periodontal Research have around 20 references per article, which is close to what Price38 defines as the normal range of references per paper. On average, papers in science journals cite about 15 references, of which about 12 refer to other journals rather than to books.39 The core dentistry journals average 17 references per article, which is slightly higher than the average for all scientific journals (which is at 14.5).

Age distributions of references Calculation of Price’s index (PI) proportion of references dated in the past 5 years offers another insight into research. The benchmarks for such a study are provided by the data of Price,39 who examined 162 journals from various subjects and dates and found that the median PI is 32, with a lower quartile of 21 and an upper quartile of 42. Hard sciences are found in the upper quartile and are called research front areas. Physics and biochemistry are two examples of this group, and it is thought that such fields are undergoing rapid progress. Does dental research go quickly or slowly? The PI of research in periodontology varies. In some years it is found in the rapid progress group, but in other years it is found in the range occupied by the normal group of sciences—that is, those that move at a somewhat slower pace. Another way of looking at this aspect of literature usage is to calculate the average number of citations to all papers as a function of time after publication. Figure 22-6 presents this information for the Journal of Periodontology and the Journal of Periodontal Research between the years of 1966 and 1971.40 It is evident that, as for other

326

Brunette-CH22.indd 326

10/9/19 12:05 PM

Using Others’ Judgments: Citation Analysis As a Means of Literature Evaluation

Journal of Periodontology

Average no. of citations

1.6 1.4

1.4

1.2

1.2

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

2

4

6

8

Journal of Periodontal Research

1.6

10

0

0

2

4

6

8

10

Years after publication Fig 22-6 Average number of citations to all papers versus time after publication in the Journal of Periodontology and the Journal of Periodontal Research from 1966 to 1971. (Reprinted from Brunette et al40 with permission.)

biomedical literature, the maximum frequency of citation of papers in both journals occurred 2 to 3 years after publication. This finding has a practical consequence; it means that our chances of locating citations to an article of interest are small in the first year after its publication. This is not surprising when we consider the mechanics of publication. Even if the reader of an article were writing a paper and decided to incorporate a reference to the newly published paper in his or her manuscript, a considerable delay in the appearance of the citation to the first article would be required as a result of the time necessary for the second article to be published. The length of that delay depends on factors such as the promptness of the referees, the diligence of the editorial staff, and the amount of revision required. For example, in 1989 it typically took about 8 to 9 months after submission for a paper to be published in the Journal of Dental Research. The processing of papers has been speeded up in recent years because most refereeing is now done over the Internet, and in 2016 the time from submission to decision on publication had been reduced to 11.4 days. Nevertheless, it remains true that the effectiveness of citation analysis in the evaluation of an article in the first year after publication is limited.

Distribution of citations among individual papers All papers are not created equal. The number of times individual papers are cited varies widely. Data

from an early study that I conducted with Simon and Reimers (Fig 22-7) shows the distribution of articles that appeared in the Journal of Periodontology from 1966 to 1971 in the first 5 years after their publication.40 This figure shows that there is a wide variation in the number of times articles published in the same journal are cited. The distributions are highly asymmetric, and the majority have low citation counts. The shape of Fig 22-7 is roughly what would be expected from earlier work on the distribution of citations among individual papers. Price39 found that the percentage of papers cited a given number of times was proportional to n–(2.5 to 3), where n is the number of citations. Power-law distributions of this type imply that a small proportion of the items under examination are responsible for a large proportion of the desired products. Figure 22-7 demonstrates that different papers in the Journal of Periodontology have a widely varying impact on subsequent research. The majority of papers have very little impact, and a significant proportion (18%) were never cited in the journals covered by the SCI (which includes most of the major journals of dental research) in the 15 years following publication. There may be other nonresearch uses of this material; hence, it is not fair to say that those articles not cited are useless. Nonetheless, it cannot be denied that their influence on subsequent research appears to be small. One can only speculate on the reasons so many papers are ineffective in stimulating further research. First, on the whole, many authors seem to choose topics that are inherently uninteresting both to other research

327

Brunette-CH22.indd 327

10/9/19 12:05 PM

JUDGMENT

20

Frequency (%)

15

10

5

0

0

5

10 No. of citations

15

20

>20

Fig 22-7 Distribution of the percentage of articles having a given number of citations in the first 5 years after publication in the Journal of Periodontology from 1966 to 1971.

workers and, after publication, to the authors themselves. Second, many scientific publications appear to contain weaknesses. For the present purpose, it should be noted that the citation record of a paper can be used as a screen to sort papers of varying impact.

discerned with citation analysis (CA), by which we look up the authors who have cited a particular paper to see if they have done so in an affirmative or negative manner. Therefore, CA can serve as an instrument (albeit not a very sensitive one) for checking on the reproducibility and acceptability of published results.

Role of citation analysis in the evaluation of individual scientific papers

Collective judgment

Acceptability and reproducibility of published results Many scientific articles undergo some sort of refereeing process. In theory, at least, the referee may be held responsible for checking on matters such as the validity of the stated conclusions, the clarity of presentation, and the suitability for publication of an article in a particular journal. However, the referee cannot be expected to reproduce the data that constitute the results section of a scientific paper. Consequently, the publication of such results represents an act of faith in the authors, in that the editors accept that the given results were obtained under the given conditions. However, subsequent workers may be unable to reproduce those results for a number of reasons. In addition, others may be reluctant to accept the authors’ interpretation of the results. Such differences in either fact or opinion can often be

One strong argument for the use of CA in the evaluation of scientific work is the collective nature of the citation index. Because it covers the vast majority of the significant journals of science, the SCI in effect canvasses the writings—and thus the judgments—of the majority of scientific investigators. This large number of judges means that CA may be the fairest means of evaluating a paper, because the effect of individual biases becomes less pronounced. As noted earlier, academic science has been defined as the social institution devoted to the construction of a rational, consensual opinion over the widest possible field.41 CA provides science with the most efficient means of demonstrating that a given piece of work has been absorbed into the scientific framework of a given subject.

Relationship of CA to peer judgment A common means of evaluating scientific work is through the peer-review mechanism. In this system, a committee of experts (typically, investigators in the

328

Brunette-CH22.indd 328

10/9/19 12:05 PM

Influencing Judgment: Lower Forms of Rhetorical Life

given subject) review and assess the quality of an investigator’s work in that subject. A major problem of the peer-review system is that it has the appearance of a “buddy system” (ie, you approve my work and I will approve yours).42 Citation frequency (the number of times a given article has been cited) is objective and is therefore an improvement on review by only a few peers. Nevertheless, a number of objections to the method have been raised. Overall, it appears that ratings made by CA corroborate the decisions made by other methods, including peer review.43 Perhaps the ultimate accolade awarded by peer judgment is the Nobel Prize. Sher and Garfield44 found that there are 30 times as many citations per average Nobel Prize winner as there are per average cited author. In a sample of 13 Nobel Prize winners, 11 were in the top 0.185% of the citation frequency file. Although part of the high citation rate was because the Nobel Prize winners produced more papers than the average scientist, still, the average number of citations per cited paper was roughly two times that of the average scientist.

Objections to CA The objections to CA are largely based on the uncertainty about the reasons people cite papers. The role and significance of citations in scientific communication have been reviewed by Cronin.45 Perhaps the most frequent objection is that papers that have been proved wrong might be quoted frequently. However, this is not the case; negative citations are rare. Another objection is that scientists could inflate the value of their work by citing themselves frequently. Because this strategy is exceedingly obvious, it would likely be self-defeating. In any case, in periodontal literature self-citation comprises a significantly greater proportion of the citations of infrequently cited papers than frequently cited papers. It is hard to pull yourself up by the bootstraps. A third objection—that citation frequency is dependent on the popularity of the specialty—has been answered by the observation that, though more papers are published in popular fields, there are also many more papers available to quote. Thus, the chance of a particular paper being quoted will be about the same in a popular field as in any other, providing other factors such as number of references per article remain constant. Nevertheless, it is true that there are differences in citation frequency between fields. As noted earlier, the most critical factor in CA is the average number of

references cited per paper. Moreover, it has been found that scientific literature grows at different rates from topic to topic and during different periods. Therefore, caution is required in making comparisons between work done in different fields or at different times. Another problem with CA is that although each citation entails recognition of another’s work, there is little known about the actual motivations of authors when they cite papers.

Summary of the role of SCI in the evaluation of individual papers Although some of the problems encountered in CA can and do weaken its strength, they are merely the noise in the measurement procedure. A major advantage of CA is that it is quantitative and, with further research, may be improved and refined. Questions (such as the impact of self-citations) can be isolated and answered. Garfield and Cawkell46 have stated that statistical routines have been developed to account for derogatory references, self-citations, multiple authorship, the stature of the journal in which the article appeared, and unusual technique papers. CA has its pitfalls when applied to individual papers—particularly recently published ones, where the amount of information available in the SCI is small. Nevertheless, it reflects the real world of science, not a fantasyland where all scientists’ work is equally valuable.46 In the real world of personal economics, the number of citations more strongly correlates with increases in salary than does either professional experience or number of publications.47 When evaluating particular papers, readers should combine CA with other techniques described in this book. Its main value for our purposes is that it enables readers to find out other scientists’ opinions of cited papers. Second, it serves to make readers suspicious of the significance of noncited papers—a suspicion that will be justified most of the time.

Influencing Judgment: Lower Forms of Rhetorical Life Sound conclusions are the result of combining valid logic with reliable data. However, for the conclusions to be accepted, an investigator must persuade the listeners (if giving a seminar) or readers (if writing a paper). Because the social interaction of scientists is said to be

329

Brunette-CH22.indd 329

10/9/19 12:05 PM

JUDGMENT

organized skepticism, the job of listeners and readers, particularly referees, is to criticize the presentation. A sound piece of work might remain unrecognized because of poor presentation or inadequate response to criticisms. Conversely, poor science might be masked by superb showmanship or rhetorical legerdemain. In a satirical paper, Kline48 termed some of the techniques used in these lower forms of rhetorical life factifuging (ie, putting facts to flight), which he describes as the most successful technique for defending the intuitively correct position against contradictory data. I have condensed Kline’s list to three categories: distraction, denigration, and terminal inexactitude.

Distraction The point of distraction is to so amuse, bore, or confuse the audience that they forget the original question, which might have had a damaging effect on the investigator’s arguments. One technique is joking. A speaker was asked whether contact guidance (a phenomenon of cell biology) had any effect on his results. Not knowing any cell biology, the speaker replied, “Your point reminds me of the story of an elderly preacher, a lifelong bachelor, who married a widow. On their honeymoon night, his new bride was anxiously waiting in bed and was surprised when her husband dropped to his knees beside the bed. ‘What are you doing?’ she asked. ‘My dear,’ replied the preacher, ‘I’m praying for guidance.’ ‘Don’t worry. I’ll provide the guidance,’ his experienced bride said. ‘You pray for strength.’” The audience remembered the joke but forgot the point of the question. Individualizing involves use of the vividness phenomenon. The case is described in excruciating detail in the hope that the concrete effect of the case will overwhelm any abstract statistical information that might contradict the investigator’s theories. Visual-aiding involves use of materials like cartoons, photographs of the speaker’s university or home, or spectacular color graphics, which are done so well that the audience concludes the science was done equally well. Word games come in many forms. One example is renaming. Logicians have long been aware that to avoid contradictions, you must make distinctions. If you are caught in a contradiction, it can be advantageous to assign new meanings to old words or to invent new terms entirely. These can be justified by a dictionary of word origins, the use of which enables you to show that “everything is either itself or its opposite.”48 Foreign

phrasing involves using phrases in Greek, Latin, French, German, laboratory jargon, or other languages to impress and confuse an audience, many of whom will be hesitant to ask what a phrase means. Apodicting is the art of issuing sententious statements or principles that the speaker feels are self-evident and hence not in need of justification by experimental data.48 When confronted with a contradictory example, a speaker might reply, “Well, that’s the exception that proves the rule.” This is obviously a silly statement; the reason for testing a hypothesis and finding an exception is to disprove the rule, yet some people believe the adage. Moreover, a show-off might point out that the word prove in the adage is used in the old Scottish sense of test. At this point, the audience will feel erudite, but the thrust of the original attack will have been lost. When faced with questions for which they have no answers, many speakers prefer to answer some other question, rather than admitting “I don’t know,” which is using a strategic diversion. This technique is particularly effective when the question is vaguely worded so that the speaker can answer it any way he or she chooses. It can also be used by speakers who are not speaking in their native tongues because they have a legitimate excuse for misunderstanding questions. Another popular technique is to say, “I’ll deal with that point later in the seminar,” and never return to it. Both methods save a speaker from embarrassment.

Denigration The object of denigration is to drag your opponent down to or below your own level. It is used when the attacker cannot disprove the conclusions but feels compelled to question their importance. Damning with faint praise might be called the English disease. A North American referee might call a given piece of work “excellent,” whereas an English referee would refer to the same work as “good”—reserving the word excellent for contributions that would lead directly to the Nobel Prize. Faint praise can be used with other techniques. An especially devastating combination is lukewarm praise followed by a but or an and together with a suggestion for future improvement—as in, “Thanks for telling us about your interesting work, and I look forward to seeing it being more rigorously pursued in the future.” The first part of the sentence lowers the speaker’s guard, while the second part delivers the knockout punch. As a general rule, critics do not have to concern themselves with feasibility or practicality; they can insist on

330

Brunette-CH22.indd 330

10/9/19 12:05 PM

Influencing Judgment: Lower Forms of Rhetorical Life

unrealistic standards. For a critic using a holier-thanthou approach, there is never enough of anything—there should be a larger sample size, more controls, more time points, more sophisticated analysis, and more modern techniques. Until these alterations are made, critics can imply that the work does not meet appropriate standards. The cost of maintaining such standards is cheap if you do not have to bear the expense. In the real world, there are always limitations of time, money, or type of patient, and any sensible evaluation of a paper has to include what is feasible. The contrarion approach makes use of the fact that any study involves making decisions, and each decision means that some other option must be foregone. The instinctive contrarian points out the merits of the other choice. For example, a clinical study can be done with volunteers or nonvolunteers. If a study used volunteers, the contrarian would argue that everyone knows volunteers are a highly unusual group who are not representative of the general population. If the same study had used nonvolunteers who had somehow been forced or selected to participate, the contrarian would argue that this selection procedure obviously biased the results. Kline48 notes that subjects have to be something, and whatever it is that they are or are not can be used as the basis for claiming probable bias. The contrarian focuses on the weaknesses of the choice. Like the holier-than-thou critic, the contrarian ignores the information provided by a study on the grounds that it was not the best possible study. Old-hatting denigrates the originality of a study by claiming the research has been done before. This is most effective if the information being cited was published in an obscure journal or a hard-to-find thesis and cannot easily be found. Old-hatters might assert that it was they who did the work and are therefore glad to see it confirmed. It has been said that scientists are divided between splitters, who make distinctions, and lumpers, who group everything together. The old-hatters are often great lumpers who show remarkable ability to find parallels in disparate pieces of work. Nothing-butting/just an example of . . . is a form of one-upmanship where research findings are dismissed as nothing but an example of some more fundamental theory. Nothing-butting has the effect of denigrating the importance of a problem or finding by viewing it as just a subset of other more general problems or findings. Thus, periodontal disease might be regarded as nothing but a minor infection of the mouth. The nothing-butter ignores all specifics of a situation and concentrates on the big picture. For periodontal disease,

the nothing-butter with training in immunology might ignore the vast economic importance of the problem, as well as the technical difficulties in treating individual teeth, and say that periodontal disease is just an example of an inappropriate immune response. The devil’s advocate probes research for weaknesses, including effects or considerations that were outside the original objectives of the study. If a given treatment were applied and produced an effect, the devil’s advocate might want to know why the effect was not larger—or at least as large as some other treatment; whether the treatment had any side effects; or whether, if the treatment was a drug, it works at other dosages. In short, like the holier-than-thou critic, the devil’s advocate shows that the research is not ideal; however, the devil’s advocate makes no pretense that he or she attains such heights and may even admit, “I’m just being the devil’s advocate here.” The devil’s advocate approach can be used constructively by individuals who wish to critique their own work and develop alternate ways of presenting their data. Moreover, it can be used as a tool prior to presentation to develop statements or responses to possible criticisms. Nevertheless, the audience might become less impressed with a speaker after his or her research has been subjected to this type of criticism.

Terminal inexactitude The speaker/writer controls the flow of information to an audience, and this person can be dishonest by not giving all available relevant information, or he or she can even provide erroneous information. The extent of this problem is unknown, but one major growth area in science is writing about scientific fraud. Occasionally, graduate students find themselves in a position where they fear doing one experiment too many— ie, the experiment that disproves their thesis. After years of hard work, it would be tempting for students who do an experiment disproving their thesis to place the results in a file drawer. For grant-hungry or pretenure professors who want to fatten their curriculum vitae, a more effective strategy might be to publish two papers—the first based on the preliminary results and the second a reexamination of the first. Terminal inexactitudes carry risks, but some investigators might agree with the comedian who—on considering Lincoln’s adage “You can’t fool all of the people all of the time, but you can fool all of the people some of the time, and some of the people all of the time”—concluded, “And that’s good enough odds for me.”

331

Brunette-CH22.indd 331

10/9/19 12:05 PM

JUDGMENT

References 1.

2. 3. 4. 5. 6.

7.

8. 9. 10. 11. 12. 13. 14.

15.

16.

17.

18. 19.

20. 21. 22. 23.

Dawes RM. You can’t systematize human judgment: Dyslexia. In: Shweder RA (ed). Fallible Judgment in Behavioral Research. San Francisco: Jossey-Bass, 1980:67–98. Spilker B. Guide to the Clinical Interpretation of Data. New York: Raven, 1986:12–18. Hastorf AH, Cantril H. They saw a game: A case study. J Abnorm Psychol 1954;49:129–134. Russel BR. Philosophy. New York: WW Norton, 1927. Gopen GD, Swan JA. The science of scientific writing. Am Sci 1990;78:550–558. Hamilton DW, Brunette DM. The effect of substratum topography on osteoblast adhesion mediated signal transduction and phosphorylation. Biomaterials 2007;28:1806–1819. Horn RE. Teaching philosophy with argumentation maps. Newsletter of the American Philosophical Association, November 2000. Novak JD, Gowin DB. Learning How to Learn. Cambridge: Cambridge University, 1996. Lloyd C, Trotter T, Blaine G, Lind J. The Health of Seamen. London: Navy Records Society, 1965:2–24. Ninio J. The ideology of scientific evaluation. Trends Biochem Sci 1981;6:8. Watson JD, Crick FHC. A structure for deoxyribose nucleic acid. Nature 1953;171:737–738. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science 1974;185:1124–1131. Groopman J. How Doctors Think. Boston: Houghton Mifflin, 2008:169. Risen J, Gilovich T. Informal logical fallacies. In: Sternberg RJ, Roediger HL, Halpern DF (eds). Critical Thinking in Psychology. Cambridge: Cambridge University, 2007:110 –130. Nisbett RE, Borgida E, Crandall R, Rago H. Popular induction: Information is not necessarily informative. In: Kahneman D, Slovic P, Tversky A (eds). Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University, 1982:102. Tversky A, Kahneman D. Causal schemas in judgments under uncertainty. In: Kahneman D, Slovic P, Tversky A (eds). Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University, 1982:117. Arkes HR. Impediments to accurate clinical judgment and possible ways to minimize their impact. J Consult Clin Psychol 1981;49:323. Dowie J, Elstein A. Professional Judgment: A Reader in Clinical Decision Making. Cambridge: Cambridge University, 1988:18–20. Ross L, Anderson CA. Shortcomings in the attribution process: On the origins and maintenance of erroneous social assessment. In: Kahneman D, Slovic P, Tversky A (eds). Judgment Under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University, 1982:129. Groopman J. How Doctors Think. Boston: Houghton Mifflin, 2008:65. Groopman J. How Doctors Think. Boston: Houghton Mifflin, 2008:169. Potchen EJ. Measuring observer performance in chest radiology: Some experiences. J Am Coll Radiol 2006;3:423–432. Stewart R. Retreat! It’s a zebra… Medical Education Experts website. https://mededexperts.com.au/2018/01/05/retreat-its-azebra/. Published 5 January 2018. Accessed 1 May 2019.

24. Stanovich KE. How to Think Straight About Psychology. New York: HarperCollins, 1992:59–63. 25. Balevi B. Melkersson-Rosenthal syndrome: Review of the literature and case report of a 10-year misdiagnosis. Quintessence Int 1997;28:265–269. 26. Light RJ, Pillemer DB. Summing Up: The Science of Reviewing Research. Cambridge: Harvard University, 1984. 27. DePaola PF. The interpretation of findings in clinical caries trials. ASDC J Dent Child 1974;41:11. 28. Tukey JW. Analyzing data: Sanctification or detective work? Am Psychol 1969;24:83–91. 29. Fisher RA. The Design of Experiments, ed 8. New York: Hafner, 1966:14. 30. Kahneman D. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux, 2011. 31. Collins J. Re-reading Kahneman’s Thinking, Fast and Slow. https://jasoncollins.blog/2016/06/29/re-reading-kahnemansthinking-fast-and-slow/. Accessed on 27 August 2018. 32. Croskerry P. Diagnostic failure: A cognitive and affective approach. In: Henriksen K, Battles JB, Marks ES, et al (eds). Advances in Patient Safety: From Research to Implementation. Vol 2: Concepts and Methodology. Rockville: Agency for Healthcare Research and Quality, 2005. 33. Gigerenzer G, Todd PM. Simple Heuristics that Make Us Smart. Oxford: Oxford University, 1999. 34. Groopman J. How Doctors Think. Boston: Houghton Mifflin, 2008:36. 35. Creugers NH, Van ’t Hof MA. An analysis of clinical studies on resin-bonded bridges. J Dent Res 1991;70:146–149. 36. Löe H, Theilade E, Jensen SB. Experimental gingivitis in man. J Periodontol 1965;36:177–187. 37. Garfield E. The significant journals of science. Nature 1976;264:609–615. 38. Price DJ. Networks of scientific papers. Science 1965;149:510– 515. 39. Price DJ. Citation measures of hard science, soft science, technology, and nonscience. In: Nelson CE, Pollock DK (eds). Communication Among Scientists and Engineers. Lexington: Heath, 1970:1–22. 40. Brunette DM, Simon MJ, Reimers MA. Citation records of papers published in the Journal of Periodontology and the Journal of Periodontal Research. J Periodontal Res 1978;13:487–497. 41. Ziman J. An Introduction to Science Studies: The Philosophical and Social Aspects of Science and Technology. Cambridge: Cambridge University, 1984. 42. Symington JW, Kramer TR. Does peer review work? Am Sci 1977;65:17–20. 43. Narin F. Evaluative Bibliometrics: The Use of Publication and Citation Analysis in the Evaluation of Scientific Activity. Cherry Hill: Computer Horizons, 1976. 44. Sher IH, Garfield E. New tools for improving and evaluating the effectiveness of research. In: Yovits MC, Gilford DM, Wilcox RH, Staveley E, Lemer HD (eds). Research Program Effectiveness: Proceedings of the Conference Sponsored by the Office of Naval Research, Washington, DC, July 27–29, 1965. New York: Gordon and Breach, 1966:135–142. 45. Cronin B. The Citation Process. London: Taylor Graham, 1984. 46. Garfield E, Cawkell AE. Citation analysis studies. Science 1975;189:397. 47. Garfield E. Can researchers bank on citation analysis? Curr Contents 1988;31:3. 48. Kline NS. Factifuging. Lancet 1962;1:1396–1399.

332

Brunette-CH22.indd 332

10/9/19 12:05 PM

23 Introduction to Clinical Decision Making Ben Balevi, DDS, MSc Donald Maxwell Brunette, PhD

Critical Thinking and Decision Making Critical thinkers would popularly be described as careful thinkers. They are people who consider important issues and formulate questions concerning them clearly. Critical thinkers do not jump to conclusions, or to put another way, rush from perceptions to conclusions. Often when a solution to some problem is put forward, critical thinkers tend to say, “It’s not that simple.” Critical thinkers gather and assess relevant information and reflect on the process of their thinking itself, such as the assumptions and rules that they apply. Moreover, they test their conclusions against relevant criteria and consider the implications and practical consequences of their conclusions. Clearly, critical thinking involves work and expenditure of time, and in a world of busy cognitive misers, it might be wondered whether many important problems or political issues receive careful critical thinking attention. Clinical decision making addresses important problems in people’s lives and has enormous economic implications, such as whether a particular diagnostic test should be covered by insurance or government programs. It has attracted considerable thought both in the professional research literature and books addressed to the laity such as Jerome Groopman’s How Doctors Think.2 As noted in the previous chapter, Nobel Prize–winning studies by Tversky and Kahneman on judgments under uncertainty have been undertaken that have shown the existence of cognitive biases arising from commonly used heuristics that undermine rational decision making. In contrast, Gigerenzer, Todd, and colleagues in the Adaptive Behavior and Cognition (ABC) group3 have promoted the use of fast and frugal heuristics (FFH) for making decisions that reflect how real minds make decisions under the constraints of limited time and knowledge. Heuristic is a term that



[Clinical practice] is a science of uncertainty and an art of probability.” SIR WILLIAM OSLER 1

333

Brunette-CH23.indd 333

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

has been used in various ways in science and mathematics, and the ABC group considers FFH as means to make inferences about unknown aspects of real-world environments. These heuristics employ a minimum of time, information, and computations to make adaptive choices in real environments. Time and information as well as a necessity for complex computations are real constraints in medical and dental practices, and it is evident that if successful FFH could be developed, they would be useful. In FFH, the decision maker through the course of life develops a collection of cognitive strategies to solve commonly faced problems. They are simple and context-specific, where the decision made does not necessarily consider all sets of principles to make a rational decision (ie, rational choice theory) but only reflects on a few criteria that are contextually important (ecological rationality).4 Breiman et al5 provide a powerful example to illustrate how such heuristics can be successful. At the University of California San Diego Medical Center, some 19 variables are measured on heart attack patients, but only three are employed to immediately classify the patient as high risk. The three criteria require only Yes/No answers. No elaborate calculations need to be done; if the answer to the three criteria questions is yes, the patient is classed as high risk. Such a simple and fast procedure might be considered to contravene the dictates of critical thinking outlined above, but the ABC group3 has developed heuristics that have proved accurate in a number of real-world scenarios. Indeed, it appears that in retrospect some famous and insightful scientists have used similar heuristics in their decision making.

Charles Darwin’s decision to marry: One-reason decision making Gigerenzer and Todd6 reviewed the process undertaken by Charles Darwin as he wrestled with the momentous decision of whether or not to marry his cousin Emma Wedgwood. Darwin lined out two columns headed by the words “Marry” and “Not Marry” and listed the pros in the “Marry” column and the cons under the “Not Marry” column. Darwin was honest, even painfully honest, when he filled in these columns. Under “Marry” he noted among other reasons, “constant companion,

(friend in old age) who will feel interested in one, object to be beloved and played with—better than a dog anyhow,” whereas under the “Not Marry” column he listed such things as “Not forced to visit relatives, and to bend in every trifle—to have the expenses and anxiety of children—perhaps quarrelling.” Darwin faced a tough problem: How could he estimate the probability of happiness faced with such a diversity of influences? In the end, Gigerenzer and Todd6 came to the conclusion that Darwin employed one-reason decision making, and his reason was having a constant companion, “a nice soft wife on a sofa,” that appeared as a much more attractive alternative than “living all one’s day solitarily in smoky, dirty London House.” A critical thinker or a modern-day statistics-based prognosticator might well wonder how such a limited heuristic worked out. It seems that it turned out well; the couple was, by all accounts, devoted to each other. Emma participated in experiments on earthworms, played music, and when Charles’s eyes were tired, she read to him. They produced ten children, but one ignored element in his decision making was problematic. Darwin had studied stock breeding closely and knew that inbreeding, such as between two first cousins, could lead to defective progeny. Three of Darwin’s children died in childhood; and it has been speculated that Darwin’s relentless promotion of the usefulness of studies in consanguinity might be explained by his ever-present awareness that he himself had married his first cousin.7 Although Darwin worried about the anxiety engendered by producing children, they proved to be useful in his scientific enterprise both as specimens and investigators. He meticulously recorded the development of his first child William Erasmus for almost 2 years.7 Moreover, his surviving seven children grew up as research assistants, collecting data, specimens, and observations. Two of them became professional scientists, and his youngest son cofounded a scientific instrument company. No idea ventured by his valuable research-assistant children was dismissed out of hand.8 A one-reason heuristic worked for Darwin, and even though he entertained many uncertainties and potential problems in his deliberations, none of them could be determined with sufficient accuracy or probability to overwhelm the influence of the “one good reason” heuristic, and in the long term some of them turned out to be benefits and not drawbacks of marriage.

334

Brunette-CH23.indd 334

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Benjamin Franklin’s moral algebra Benjamin Franklin was another two-column man, but with Franklin the two columns were labeled “Pro” and “Con.” Franklin, by his account, was not a man to jump to conclusions. During a period of 3 to 4 days, he would fill the columns with points pro and con and measure them under consideration. He then assigned relative weights to the strengths of the points and computed a balance. That is, if he assigned a weight of two to a pro point, then it could be balanced by two con points assessed a weight of one. After considering the points and weights of the entries into the two columns, Franklin could calculate whether the evidence on balance favored the pro or con side of the argument.3 There is no doubt that Franklin was and is admired as a stellar example of a rational critical thinker. It would, however, be hard to imagine modern-day patients waiting for days while a clinician ruminated over a decision. Moreover, the delay in treatment caused by the time taken to make a decision might adversely affect the clinical outcome. A further problem with the Franklin method is that it lacks sensitivity; only a rough balance can be calculated, and while that may be sufficient for measures that differ greatly in support, it would not be useful for identifying the preferred alternative in closely balanced problems. Another issue is the timing that is allowed for consideration of the measure; 3 to 4 days would generally be a sufficiently lengthy time to consider most problems, but would it be sufficient to comprehensively explore the background literature on an important dental or medical problem? As Marvell complained in his poem “To His Coy Mistress”: Had we but world enough, and time, This coyness, Lady, were no crime.  We would sit down and think which way To walk and pass our long love’s day. Some sort of stopping rule is required, where we quit gathering information and get down to action. Thus, modern evidence-based dental decision makers have procedures to search the literature comprehensively and obtain data on the probabilities and utilities of treatments (to be discussed separately). When I first joined the faculty, it was my habit to have lunch with two members of the Periodontics division, Tim Gould, whom I co-supervised during his PhD program at the University of Toronto, and Alf Ogilivie, a senior faculty member who headed the division and had directed the postgraduate periodontics program at the University of Washington. Alf was widely respected as

a superb clinician throughout the periodontics community. One day Tim met me at our usual location and said while chuckling, “Alf won’t be joining us today— the Dental Hygiene Department requested a consult for a ‘snap’ prognosis on a tooth.” We both chuckled as we knew that Alf was not the type of person who delivered “snap” judgments. “And,” Tim continued, “Alf sat down and is currently examining every tooth in the patient’s mouth. No lunch for him, no lunch for the student, no lunch for the instructor who requested the consult.” Now it was not necessarily inappropriate to ask an expert for a snap decision; one of the characteristics of experts is that they can often arrive at correct decisions quickly on account of their ability to recognize patterns quickly. But it can happen that they arrive at their decision too quickly and do not consider all the possibilities. In fact, Groopman advises patients to ask their doctors questions that slow specialist clinicians down and consider other options.2 In any case, I am sure that the student had an exceptional educational experience as Alf examined the patient carefully and explained his reasoning, balancing the current situation, the likely prognosis with no treatment, and the likely interactions that would result were the tooth to be extracted. Alf would be practicing the Benjamin Franklin approach of considering the pros and cons of the available options, but the process takes time, in this case lunch time. No doubt a wiser, if hungrier, student emerged from that clinical session.

Decision Making in the Context of Patient Care Fast and frugal If the context of the clinical scenario is one of urgency, then the clinician must make a quick decision on managing the patient’s care that satisfies the fast and frugal heuristic criteria of relief of suffering (ie, pain management), restoration of function and appearance, and reasonable accessibility to care (ie, affordability). Take for example, a 35-year-old woman named Ruth who presented to your dental chair with a broken front tooth as a result of a blow to her mouth while playing baseball. The trauma is significant, with only the cervical third of the crown remaining and visible bleeding from the exposed pulp chamber. Radiographic examination confirms that the root is not fractured and its periodontium is intact. The context in this case is one of urgency where, you, as an experienced clinician,

335

Brunette-CH23.indd 335

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

need to use a fast and frugal heuristic to decide on a treatment that addresses three compelling issues: (1) alleviate the pain, (2) restore dental function and appearance (to maintain the patient’s dignity), and (3) cost (what is the patient willing to pay for dental treatment). 1. Alleviate the pain. This is quickly addressed by administering local anesthetic and a prescription of analgesics with the instruction to the patient to take the analgesic on an as-needed basis. 2. Restoring function and appearance. The following same-day procedures are available to the dentist: (1) Save the tooth by performing an emergency pulpectomy and temporarily restoring function and esthetics with a temporary composite filling with the future prospect of restoring the tooth with root canal therapy, post and core, and crown. (2) Extract the tooth and deliver a flipper denture the same day. 3. Affordability. This criterion needs to be applied in consideration of the economic status of the patient. A wealthy Ruth has the privilege of choosing between saving or extracting the tooth. Her decision will depend on the economic value that she places on keeping her natural front tooth. However, you learn that Ruth is a single mother working at minimum wage in a job that includes neither dental benefits nor paid sick days. She believes she has no prospect of changing jobs in the next couple of years. Ruth feels the immediate and future cost of saving the tooth through the complex and multiple-appointment procedure of the root canal therapy, post and core, and crowns is not economically feasible. The shared decision made between Ruth and you, the dentist, is to remove the tooth and deliver a removable acrylic flipper denture later that day. This solution may not be optimal because of possibilities of future problems caused by the missing tooth and the possible lack of stability of the removable prosthesis. The decision to remove the tooth is made in the context of Ruth’s current circumstances (ecological rationality), and a different decision may have been made if there were neither time nor cost considerations. About a year later, Ruth returns to your office complaining that the denture you made is “loose, bulky and rough.” On examining the denture, you noticed that there is a crack through the center of the denture base.

Furthermore, Ruth tells you about her new job that has doubled her salary. Moreover, she has a comprehensive employee benefit package that includes dental insurance. She is now ready to consider other options. Aside from the cost, Ruth also wanted to be sure that the tooth replacement option chosen would last for at least 5 years. Now, the clinical context has significantly changed from a year ago, from one of urgency to one of elective care. Fortunately, dentists are often confronted by scenarios that need not be solved immediately, but rather they have time to reflect on available options with all their associated advantages and disadvantages. After completing a thorough examination, you find that Ruth has generally good oral hygiene with no signs of active oral disease. Two unrestored teeth with healthy periodontium border the edentulous space. A cone bean computed tomography scan confirms adequate bone volume to support a dental implant. Hence, you conclude she is now a candidate for complex prosthetic treatment. You share this positive finding with Ruth and review the following four options to manage the edentulous space: 1. Repair her current acrylic removable denture. 2. Fabricate a new cast removable partial denture (RPD). 3. Employ a conventional fixed partial denture (FPD). 4. Insert a dental implant followed by a crown. Ruth understands that she can continue to wear her slightly broken denture while she ponders the list of options of varying durability, comfort, and cost. In quantitative decision making, what action Ruth and you decide to pursue will depend on (1) how you as a knowledgeable professional quantify the likelihood of success and the risk of harm for each option and (2) how Ruth assesses the tradeoffs and utility of each option to her. Hence, the treatment plan decision involves Ruth and you interpreting the clinical evidence against a backdrop of your understanding of the current psychosocial and biologic knowledge bases and her values and preferences. The information includes evidence found in clinical research articles, conferences, workshops, study clubs, continuing education courses, articles by opinion leaders and peers, as well as your own clinical experience and biases. Because there may be some pieces of evidence that conflict, the decision on Ruth’s care is done in a world of imperfect information, uncertainty, and choice.

336

Brunette-CH23.indd 336

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Despite the inexactness of clinical practice, you must move forward by making a definite clinical decision with the objective of meeting Ruth’s needs and expectations (ie, clinical success). Ideally, a clinician’s decision is defensible on universally accepted rational principles. Evidence-based decision making in health care is becoming one such widely accepted approach.9–12 Evidence-based dentistry (EBD) is defined by the American Dental Association as “an approach to oral health care that requires the judicious integration of systematic assessments of the best clinically relevant scientific evidence, relating to the patient’s oral and medical condition and history, with the dentist’s clinical expertise and the patient’s treatment needs and preferences.”13 Thus, EBD is largely based on three guiding principles: (1) the dentist’s clinical experience, (2) the best current scientific evidence, and (3) the patient’s needs and preferences. The key message is that dentists along with their patients should consider all three aspects of EBD when arriving at an oral health care decision. Any decision that ignores any one of the domains risks an undesirable clinical outcome and failure to deliver high-quality health care. In other words, the three domains of EBD are each quantitatively important to clinical decision making.13,14 Quantitative decision tree analysis (DTA) is a systematic mathematical approach to decision making that utilizes the principles of EBD in an attempt to make defensible clinical decisions in a world of imperfect information, uncertainty, and choice. DTA has its roots in business and economics.15 Its application to medicine began in the 1960s and then evolved through the 1970s and 1980s.16–19 In 1997, a series of seminal articles introducing the application of DTA to patient care were published in the journal Medical Decision Making.20–24 This discipline of medical decision making continues to evolve with the objective of delivering safe, effective, and efficient health care to patients and society.25 It is based on expected utility theory’s axioms that generally assume that rational people make their own best self-interest decisions.15 In other words, given the time to thoroughly review the details of all competing options, people will select the option that is most likely to maximize their perceived well-being (utility) and/or minimize their perceived harm to themselves (disutility). To achieve this goal in a world of uncertainty and choice requires quantitatively weighing the probability of occurrence and the desirability of the outcomes associated with each option.22,26–28

The systematic process of quantitative decision tree analysis Coming to a complex clinical decision that an oral health care provider can defend involves a systematic process. In brief, there are seven steps in the analysis: 1. The precise definition of the clinical problem 2. The identification of possible treatment strategies 3. The construction of a “decision tree” that graphically depicts the decision process 4. The calculation of the probabilities for success and failure in each branch of the decision tree 5. The assessment of the utility (ie, perceived value to the patient) of each outcome 6. Determining the overall utility of each treatment option 7. Selecting the treatment option that maximizes the patient’s expected utility This approach to decision making attempts to incorporate patient preferences in the decision together with the probability of success to the patient. Another decision-influencing factor is the cost of treatment and the issue of cost-benefit and -effectiveness analysis. Resolving this issue requires estimating how much the patient is willing to pay for improved quality of life as assessed by the patient, and the comparison of cost per unit value of improved quality of life among the different treatment options. Overall, the DTA approach minimizes adverse surprises to the patient, as the patient knows in advance of the probability of the treatment failing, the expected benefits and the relative costs, and the alignment of the treatment to their personal values prior to committing to treatment. The dentist can be assured of offering sufficient information to enable the patient to grant informed consent to treatment; an important benefit in these times of professional accountability. The steps will now be considered in more detail.

Step 1: Defining the clinical problem Evidence-based decision making starts with formatting the clinical problem into an answerable question. This begins with defining the patient’s problem followed by listing all feasible interventions, including no intervention, to clinically manage it. Finally, the important consequences for each option are identified and defined by a number of possible outcomes. Ruth must decide between just repairing her current acrylic removable denture or upgrade to either a dental

337

Brunette-CH23.indd 337

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Table 23-1 | Balance sheet Options/Action

Advantages

1

Cast RPD

2

FPD

• • • •

Disadvantages

Least costly of all new options Minimally invasive Treatment completed in 2 weeks More stable than an acrylic partial denture

• Fixed appliance

• Removable • Risk of damaging adjacent teeth29 • Often needs adjustments • Invasive procedure • Costly • Need to wear provisional prosthesis for 3 weeks

• Flossing under prosthesis difficult • Risk of endodontic therapy on adjacent teeth 3

Implant

• • • •

Fixed appliance Almost like a natural tooth Easy to floss Does not compromise adjacent teeth

• Undergoing costly invasive dental treatment

• Tooth replacement will typically be functional in 4 months

• Need to wear removable denture for 4 months

4

Repair

• Quick • Least costly

• Removable appliance • Risk of damaging adjacent teeth29 • Often needs adjustments and repairs

Ruth’s options of managing her missing front tooth include a cast RPD, a conventional three-unit FPD, a single crown supported by a dental implant, or repairing her current removable acrylic partial denture.

implant, cast RPD, or FPD. In other words, you and Ruth are seeking the answer to the following PICO (see chapter 5) formatted clinical question. Which of the following three options—(1) new cast RPD, (2) conventional three-unit FPD, (3) dental implant (interventions), compared to just repairing the current acrylic denture (status quo comparison)—will make Ruth (patient) better off after 5 years (outcome)?

Step 2: Decision balance sheet Once the problem has been defined, a decision balance sheet, not unlike Benjamin Franklin’s moral algebra of pro and con, is completed. This involves listing all important treatment options with their associated advantages and disadvantages.29 Faced with an edentulous space, you wonder if Ruth will be better off in 5 years with a new prosthetic appliance or not, and if so, which treatment would be in her own best interest? For simplicity, only the important benefits and drawbacks are mentioned (Table 23-1). The simplest and quickest solution is to repair Ruth’s acrylic denture. However, the alternatives may offer her the durability and comfort she is seeking over a 5-year period. Although both the FPD and the implant

offer the benefit of a fixed functional tooth replacement, the former option immediately rids Ruth of the removable appliance, but at the cost of possible harm of preparing the abutment teeth, which may subsequently require endodontic treatment.

Step 3: Tree construction The purpose of a decision tree is to graphically depict the clinical decision through a series of available options with their respective possible outcomes. Figure 23-1 is a schematic of the outcomes associated with each treatment option. A well-defined clinical problem can be depicted as a tree with branches representing the series of potential resolutions available. Branching from each option is a series of mutually exclusive outcomes that can possibly occur. This decision tree attempts to dissect a problem into clear and easily understood components. The square box signifies the decision node. It is at this point in the decision tree that the decision maker is asked to make a choice. What follows are the chance nodes indicated by circles. Branching from these chance nodes are possible consequences, as represented by the probabilities of positive and negative outcomes. Since all

338

Brunette-CH23.indd 338

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Vital p[vital] FPD Not vital 1 – p[vital]

Missing maxillary incisor

Implant

Cast RPD

Repair

Functional FPD

Endo — Success p[endo]

Functional FPD

Endo — Failure 1 – p[endo]

Missing tooth

Success p[implant]

Functional implant

Failure 1 – p[implant]

Missing tooth

Success p[cast RPD]

Functional cast RPD

Failure 1 – p[cast RPD]

Missing tooth

Success p[repair]

Functional acrylic RPD

Failure 1 – p[repair]

Missing tooth

Fig 23-1 Decision tree for the management of a missing maxillary central incisor. Nodes along a decision tree. Decision node given by a blue box branching over the treatment options, chance nodes (ie, where there is a possibility of success or failure) represented by a green circle. Final outcomes along each branch of the tree indicated with a red triangle. The probabilities for each outcome are given at each chance node. The tradeoff and utilities are given at each outcome triangle.

the branches off a chance node are mutually exclusive, the sum of their probabilities must equal one (ie, 100%). The end node, as depicted by the triangle, indicates the decision maker’s quantitative value judgment of the outcome from the branch that led to it. At this point, all uncertainties and utilities associated with each decision have been realized. In other words, the decision maker understands that once they commit to a decision, they accept the associated uncertainty and utility of each outcome. These series of probabilities leading to each outcome and the outcomes themselves can then be analyzed individually and then compared collectively with the intention of arriving at the optimal decision in a world of uncertainty and choice.

Step 4: Assigning uncertainty; the probability to each branch The historically influential clinician Sir William Osler (1849–1919) commented that “[Clinical practice] is a science of uncertainty and an art of probability.”1 Probability is a measure of uncertainty, anchored between the number 0 (zero) and 1 (one). These anchor points indicate with absolute certainty that an event will not occur (ie, p[lower anchor] = 0) or will occur (ie, p[ upper anchor] = 1). The conventional notation of probability is given as p[A], where [A] refers to the event. In the decision tree (see Fig 23-1), the probably of an abutment tooth remaining vital is given as p[vital]. Since the two branches from the chance node are mutually exclusive then the probability that an abutment tooth will require a root canal treatment is given by 1 – p[vital]. The same holds true for the success of root canal therapy (p[ endo] ), a dental implant (p[implant]), a cast

339

Brunette-CH23.indd 339

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Table 23-2 | Probabilities of treatment events for the management of a missing maxillary central incisor Notation

Probability (5 years survival)

Reference

Tooth remains vital after abutment crown preparation

p[vital]

0.95

Whitworth et al33 (2002)

Success of root canal

p[endo]

0.80

Setzer and Kim30 (2014)

p[implant]

0.97

Setzer and Kim30 (2014)

p[cast RPD]

0.75

Rehmann et al34 (2013)

p[repair]

0.40

Vermeulen et al35 (1996)

Event

Success of implant Success of cast RPD Success of acrylic RPD

Probabilities are estimated by clinical experience, the patient’s attributes, and the literature.

RPD (p[cast RPD]), and repairing Ruth’s current acrylic denture (p[repair]). Clinical outcomes are rarely absolute but come with some degree of uncertainty that is reflected by the clinician’s opinion of its likelihood. The factors that play in this opinion can come from objective or subjective sources.20,26 An objective probability infers that a true measure of an event’s likelihood exists in a specific population. Repeated measuring of the occurrence of each specific clinical scenario can estimate this specific population uncertainty. The proponents of this school of thought are referred to as frequentists. However, relying only on objective probabilities to make clinical decisions is often impractical in the real world because no such large database of frequentist data is available for every clinical scenario. Hence, clinicians often consider their personal clinical experience, the attributes of the individual patients, as well as reliable population-based probability data available in the professional literature to estimate a subjective probability of the outcome. A comprehensive discussion on how to integrate clinical experience, patients’ attributes, and published probabilities into a single number to express the clinical uncertainty of patient outcomes is found in Sox, Higgins, and Owen’s Medical Decision Making.26 Often experienced clinicians reflect on past events to directly assess the probability of future outcomes. For example, a dentist may not have had an implant failure in the last 5 years, during which she inserted 100 implants, but knowing that no one is perfect, she realized that there was a possibility of some of them failing. Using the table in appendix 2 and her experience of 0 failures in 100 implants, she estimates that the 95% confidence interval for failure ranges from 0 to 3.62%, and hence the 5-year durability of an implant placed in her practice might be as low as 96.38%.

Another approach to estimating the probability of an outcome is by indirectly assessing its probability against a reference point. Imagine playing the following hypothetical betting game in which you have to select between two options; the options are sequentially changed so that at some point they are equally attractive to you. There are two starting points: Option 1 (reference option): You are confident that the implant will survive 5 years. Option 2 (alternative option): Play a lottery game with a 99% chance to win $4,000 (cost of the implant). If you choose to play the lottery, then it is inferred that you believe that the five-year success rate of an implant is less than 99%. In the next step, alternative option 2 is changed from 99% to 95%, and you determine if you still prefer the lottery option. If so, it implies that you continue to have more confidence on a 95% chance at winning $4,000 then the implant lasting 5 years. You continue with the iterative mind exercise until you feel the options are equally likely, that is, you are confident that an implant lasting 5 years is as likely as, for example, a 90% chance of winning $4,000. This point of indifference between the two options indirectly estimates your assessment of the probability that the implant will survive 5 years. The direct and indirect approaches of assessing clinicians’ subjective probabilities are prone to human error (biases) that are hard to control. For example, how similar is Ruth to the other patients you treated in the last 5 years (representative bias) or is the streak of success in the last 5 years causing you to forget the string of failures you had 6 years ago (availability bias)?

340

Brunette-CH23.indd 340

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Table 23-3 | Outcomes in dentistry Dimension Biologic status

Clinical status

Psychosocial status

Economic cost

Category

Example

Physiologic

Salivary flow and consistency, demineralization, and inflammation

Microbiologic

Oral microflora composition, presence of specific pathogens

Sensory

Presence of pain, paresthesia

Survival

Longevity/loss of tooth, pulp, tooth surface, restoration

Mechanical

Smoothness of margins, surfaces, and counters

Diagnostic

Presence of pathology, caries, periodontal disease

Functional

Ability to chew, speak, swallow

Satisfaction

Satisfaction with treatment, dentist, oral health provider

Perception

Esthetics, oral health self-rating

Preference

Values for health states and health events

Oral health quality of life

Rating for how oral health affects life

Direct

Out-of-pocket payments, third-party payments

Indirect

Lost wages, transportation, child care expenses

Based on Bader and Ismail.37

Although probability estimated from clinical experience is a key part of clinical practice, referring to published probability estimates and considering the attributes of the individual patient are also important for complementing and adjusting subjective estimates to more accurately represent the actual situation. For example, dental implants are commonly reported to be 95% successful after 5 years in function.30 However, if Ruth smokes, you may adjust the implant failure rate to be two- to threefold higher.31,32 Furthermore, the risk of endodontic therapy as a result of preparing an abutment crown on a nonrestored tooth is likely to be lower than the reported probability in the literature because the data in that study was taken from a random cohort of patients with both nonrestored and restored teeth.33 Returning to the clinical example, Table 23-2 presents estimates of probability of success from the dental scientific literature that could be the basis of DTA calculations in Ruth’s case.30,33–35

Step 5: Assigning value judgments/ preferences (utility) to the outcomes of each option The decision tree’s end points (triangles) represent outcomes. In the PICO defined question, the outcome was inferred by the vague proposition of “better off in 5 years.” Ideally, the outcome should be clearly defined, and for the sake of DTA, the desired outcome of interest must be measurable. Outcomes are the result or

consequences of an action.36 For example, the action of restoring a missing maxillary central incisor with an FPD can lead to three possible outcomes: a functional FPD, a functional FPD with subsequent endodontic therapy, or—in the worst-case scenario—an even larger edentulous space if the FPD fails to survive 5 years. In their seminal paper, Bader and Ismail37 describe four dimensions of oral health outcomes: (1) biologic status, (2) clinical status, (3) psychosocial status, and (4) economic cost. Each dimension is further subdivided into a series of categories (Table 23-3). Patient-centered outcomes Historically, dentists were trained to assess the quality of the biologic or clinical status of oral health care outcomes without seriously considering the psychosocial and/or economic consequences to the individual patient. Moreover, there is suggestive evidence that health care providers are generally poor at evaluating what is important to the patient.38 The focus on patient-centered outcomes is growing and has become an active field in health care research.39–42 Patient-centered outcomes are outcomes that directly speak to the personal experience and concerns of the patient. Patient-centered outcomes usually concern psychosocial and economic dimensions. Patient-centered outcomes consider the amount of benefit perceived by the patient for the clinical outcome balanced against the nuisances associated with treatment, the risk of harm, and cost. For example,

341

Brunette-CH23.indd 341

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Repair

0.0

0.1

0.2

0.3

cast RPD

0.4

Missing tooth

0.5

0.6

0.7

FPD

Implant

0.8

0.9

1.0 Natural tooth

Fig 23-2 Visual analog scale. On a VAS, Ruth (the hypothetical patient) values the repair of her denture, new cast RPD, FPD, and implants as 0.3, 0.5, 0.8, and 0.9, respectively.

Ruth will incur a significant financial cost if she chooses to replace a missing tooth with a new prosthesis. With this cost is the nuisance of spending time in the dental office, the likelihood of postoperative pain and the risk of requiring root canal therapy as a result of abutment crown preparation. Intuitively, it is assumed that Ruth will select the treatment she perceives will make her significantly better off in the long-term to compensate for the short-term sacrifices of her time, cost, and nuisance. Health-state utility The decision on selecting the best option will depend highly on how much Ruth values the outcomes associated with each end point in the decision tree. Ruth’s value and preference to a specific state of health is referred to in economics as a utility.22,25,27,43 A healthstate utility (HSU) is a numeric indicator (often scaled between 0–1, although scales of 0–10 or 0–100 are also used) to indicate the desirability to the patient or preference to a given health state. Where 0 represents the worst-case health state and, depending on the scale, 1, 10, or 100 represents the best-case health state. Utility is an abstract term that essentially reflects the value the patient places on a non-ideal health status (removable prosthesis, implant, FPD, repaired denture) compared to the worst-case scenario (missing tooth) and the ideal health status (nonrestored tooth). Over the last half century, much work has gone into developing reliable and valid methods of quantifying patients’ HSU.15,22,26,27,43,44 A comprehensive discussion on utility values and how they are used in health care economics analysis are found in Sox et al,26 Hunink et al,27 and Drummond et al.45 The three most common approaches to measuring patient HSU (ie, preferences) are the standard gamble,

time trade-off, and ranking preference with the use of a visual analog scale (VAS). 1. Standard gamble. This is considered the most valid measure of utility because it measures individuals’ preferences in the context of uncertainty and possible undesirable consequence (ie, a gamble).9,16,26,27,43–45 Although beloved by statisticians, patients typically find the notion of working through probabilities of immediate outcomes confusing.16,26,27 2. Time trade-off. This approach asks the patient to choose between a shortened life in good health and a longer life in poor health, as can occur in the treatment of cancer where a treatment such as chemotherapy can have very undesirable consequences even though it extends life span.26,27,44,45 It does not lend itself to Ruth’s case since the dental scenario is not a choice between life and death. 3. Visual analog scale. This psychometric approach entails having the patient indicate his or her preference by placing a mark somewhere on the linear scale anchored between the worst-case scenario (0) on one end and the best-case scenario (1) on the other.26,27,44,45 It is often seen in the dental literature because this approach is intuitive and thus easier to apply to patient decision making.44–56 Purists might argue that a VAS does not generate a true HSU because it is not a “choice-based” technique that demands a sacrifice or risk to the decision maker when evaluating their preferences.44 However, several transformation scales have been proposed to convert VAS values into truer HSUs.27,57 Nevertheless, VAS is commonly used, still brings aspects of the patient’s preferences into the analysis, and hence appears to be reasonably valid when applied to a DTA.25,27,58,59

342

Brunette-CH23.indd 342

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Table 23-4 | Ruth’s outcome utilities and costs for the management of a missing maxillary central incisor Outcome Functional FPD

No root canal

Utility

QAPY (years)

Cost ($)*

0.80

4.00

3,200

With root canal

4,000

Functional implant

0.90

4.50

4,000

Functional cast RPD

0.50

2.50

1,500

Functional acrylic RPD

0.30

1.50

300

Missing tooth

0.00

0.00

0

*Utilities and patient utilities are hypothetical values.

QAPY, quality-adjusted prosthetic years based on a 5-year life of the prosthesis.

The VAS approach is simple to employ chairside in a conversation between the patient and an oral health care professional. Hence, you elect to assess Ruth’s preferences by ranking the value of each outcome on a VAS between anchor points of the best-case scenario (1 = living with a perfectly healthy front tooth) and the worstcase scenario (0 = living with a missing front tooth). Applying the approach to Ruth involves asking her to indicate on a calibrated straight line, anchored between the worst- and best-case scenario her preference for each specific outcome (Fig 23-2). Using a VAS, Ruth assessed the utilities for the possible outcomes of treating a missing front tooth as: 0.9 for the dental implant, 0.8 for the FPD, 0.5 for the cast RPD, and 0.3 to continue with her repaired current acrylic flipper denture (Table 23-4). Intuitively, these utility measurements reflect Ruth’s preferences relative to the ideal state of perfectly healthy teeth. For example, Ruth perceives that a dental implant, FPD, a new cast RPD, and a repaired denture offer her 90%, 80%, 50%, and 30%, respectively, the satisfaction of a perfectly healthy front tooth. These numbers only represent Ruth’s perceived preferences of each state without considering the number of years she will be living under that state of health.

in this case is 5 years. QAPY is calculated as the product of the utility and the time frame. For a perfectly healthy tooth, the QAPY would be 5 (ie, 1 × 5 years); hence Ruth’s QAPY for a repaired denture, new RPD, FPD, and dental implant are 1.5 (ie, 0.3 × 5), 2.5 (ie, 0.5 × 5), 4 (ie, 0.8 × 5), and 4.5 (ie, 0.9 × 5), respectively. Intuitively, these numbers mean that over the course of 5 years, Ruth estimated the repaired denture, new RPD, FPD, and implant to respectively offer her 1.5, 2.5, 4, and 4.5 years of a quality of life equivalent to a perfectly healthy tooth whose value for 5 years would be 5 (see Table 23-4). Quality year adjustments used in health economics thus represent how the patient values a care or specific health state over a time of interest (eg, the course of the patient’s life or the life of the dental prosthesis).

Quality-adjusted years Quality-adjusted years is a construct used in health economics to account for not only the perceived value of the state of health but also includes the time a patient is asked to live with this perceived state of health. Quality-adjusted life years (QALY) is often used when the time frame is an estimate of the patient’s remaining years of life. In Ruth’s case, we are only considering the life of the prosthesis and hence refer to the quality-adjusted prosthetic years (QAPY).28,44,50 QAPY considers Ruth’s preference that each prosthesis offers her over the time frame considered, which

Expected utility value The quantitative interpretation of a decision tree is determined by the expected yield of each decision. With the assumption that Ruth is a rational person, she will select the option that will optimize her utility. The expected yielded utility of each option depends on the possibility of each outcome’s occurrence and the utility Ruth assigns to it. The expected yield of each treatment option once the probability and utility are considered is referred to as the expected utility value (EUV). This is calculated by “folding back the tree” to obtain the final EUV of each decision, which is simply

Step 6: Determining which option maximizes utility (folding back the tree) According to the axiom of utility theory, the best decision Ruth can make will depend on the quantitative analysis of the assigned probability and her utility for each outcome.15,26,27 These numbers can be applied to the decision tree (Fig 23-3).

343

Brunette-CH23.indd 343

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Vital 0.95 FPD Not vital 0.05

Missing maxillary incisor

Implant

Cast RPD

Repair

4 QAPY

Endo — Success 0.80

4 QAPY

Endo — Failure 0.20

0 QAPY

Success 0.97

4.5 QAPY

Failure 0.03

0 QAPY

Success 0.75

2.5 QAPY

Failure 0.25

0 QAPY

Success 0.40

1.5 QAPY

Failure 0.60

0 QAPY

Fig 23-3 Ruth’s decision tree for the management of her missing maxillary central incisor.

the calculated value of the product of all probabilities and utilities associated with each branch of the decision node.17,23,26,27 That is, each utility is weighted by its probability. For example, in the instance of the dental implant, there is a 97% chance of success multiplied by a utility of 4.5 QAPY to yield an expected utility of 4.37 QAPY. The other branch of the dental implant option (implant failure) contributes 0 (ie, the 3% probability of failure times the QAPY of a missing tooth being 0) to the expected utility value. Therefore, the total for the dental implant choice is 4.37 QAPY. In theory, the QAPY can take on a negative value. For example, implant failure can result in an outcome worse than merely losing a tooth. A fractured implant in situ would often require surgery for its removal with the associated monetary and personal costs, such as pain. Such negative utilities require modifications of the decision tree calculation, which are described elsewhere.60–62 Although there is a possibility that Ruth could experience such a situation and experience a

net disutility, the probability of such an outcome is reported to be as low as 0.16%.63 Hence, this disutility was not considered in Ruth’s DTA. Simply stated, the EUV means that if the decision maker were to carry out many such decisions with the same probability of success and failure and utilities, then the average utility to the decision maker would be given by the EUV. DTA is based on the grounds that the “reasonable” decision makers, accepting that they live in a world of uncertainty, seek to make an a priori choice (ie, gamble) that maximizes their EUV. Folding back this tree favors the decision to replace the missing central incisor with a dental implant over the other three options. It provides the highest EUV of 4.37 QAPY. In other words, the implant offers Ruth 0.41 more QAPY (4.37–3.96, equivalent to just 5 months) than the FPD (Fig 23-4). Furthermore, the implant offers more than double the QAPY of the cast RPD and 7 times more the QAPY than repairing her current acrylic partial denture.

344

Brunette-CH23.indd 344

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Vital 0.95 FPD

4 QAPY

EUV = 3.96 Not vital 0.05

Missing maxillary incisor

Implant

Cast RPD

Endo — Success 0.80

4 QAPY

Endo — Failure 0.20

0 QAPY

Success 0.97 EUV = 4.37 Failure 0.03

4.5 QAPY

0 QAPY

Success 0.75 EUV = 1.88

2.5 QAPY

Failure 0.25

Repair

0 QAPY

Success 0.40 EUV = 0.60

1.5 QAPY

Failure 0.60

0 QAPY

Treatment option FPD Implant

EUV (QAPY) calculation 0.05 × [(0.20 x 0) + (0.80 × 4)] + (0.95 × 4) = 3.96 (0.97 × 4.5) + (0.03 × 0) = 4.37

Cast RPD

(0.75 × 2.5) + (0.25 × 0) = 1.88

Repair

(0.40 × 1.5) + (0.60 × 0) = 0.60

Fig 23-4 Ruth’s DTA for the management of her missing maxillary right central incisor. EUV area calculated as a weighted average of the probability and Ruth’s utility for each outcome.

Step 7: Sensitivity analysis The validity of the decision arrived from the DTA depends on the accuracy of the data used. Appreciating that clinical decisions are made in a world of imperfect information, the assigned probabilities and utilities used in a DTA are subject to error. A sensitivity analysis assesses how robust the tree’s conclusions are under conditions of error. The validity of the assumptions made in the branches of the DTA is typically tested using a sensitivity analysis in which the probabilities are varied over a small range to determine the sensitivity of the decision to the

estimates of probability. A robust decision is one that is relatively insensitive to the values assumed. In assigning probabilities to Ruth’s decision, we had to arrive at a decision based on many factors (clinical experience, published data, patient individual features), which are all prone to error. For example, there is a possibility that the success of dental implants was overestimated. The question arises of how changes in the assigned 5-year probability of implant success affect Ruth’s decision. Assessing the effect of changes in only one variable is referred to as a one-way sensitivity analysis. A one-way sensitivity analysis was conducted by varying the value assigned to the probability of a

345

Brunette-CH23.indd 345

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Sensitvity analysis EUV vs success of implant 5

4.5

4

Expected utility value [QAPY]

3.5 FPD Implant

3

Cast RPD Repair

2.5

2

1.5

1

0.5 0.42

0.14

0.85

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of implant success p[implant] Fig 23-5 One-way sensitivity analysis with varying the probability of dental implants’ 5-year success. The threshold probability between the implant and the repair, cast RPD, and FPD are 0.14, 0.42, and 0.85, respectively.

dental implant success, where “success” is defined as the dental prosthesis being still in service after 5 years (Fig 23-5). The intersection between the implant option and another option is referred to as the threshold point and is the point where the EUV changes in favor of one treatment over the other. The EUVs of the repaired denture, cast RPD, and FPD are about 0.6, 1.88, and 3.96, respectively. The probabilities of implant success in the sensitivity analysis that yield EUVs to the right of (ie, greater than) these numbers favor the implant over the

specific treatment. For example, if the actual success an implant is 80%, then the calculated EUV becomes 3.6, and the preferred treatment option becomes the FPD. The figure also shows the threshold probabilities for the implant being the preferred option relative to each of the other treatments. A similar sensitivity analysis could be done based on varying the probability of other variables in the decision tree, including varying the probability of the FPD abutment needing a root canal therapy or varying Ruth’s utility for a cPUD.

346

Brunette-CH23.indd 346

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

High effectiveness

+

Exclude treatment

2

Cost-effective treatment



+

Questionable treatment

4

Decisive treatment

– Low effectiveness

Economic analysis: Cost-effectiveness analysis and willingness to pay analysis Utility analysis considers the patient’s preference for a specific health status. This generally includes Ruth’s perceived value judgment for a specific health status (ie, replacing front tooth with an FPD, implant, cast RPD, or repair) against its actual and potential consequences (ie, nuisance and harm) resulting from the various treatments. It did not consider the cost she would incur to replace a missing tooth with a repaired or new prosthesis. Although the implant offers Ruth the maximal EUV, it also comes at significant cost compared to its alternatives. Hence, Ruth is interested in appreciating the comparative cost-effectiveness between a new prosthesis and just repairing what she already has. A cost-effectiveness analysis compares the impact and cost between different treatment options.26,27,44,45 The relationship between treatment cost and effectiveness is represented as a cost-effectiveness plane (Fig 23-6).64 The upper left quadrant (first quadrant) of the cost-effectiveness plane refers to treatments that cost more but are less effective than the standard treatment

∆ Effectiveness

Low cost

High cost

1

∆ Cost

High cost

Low effectiveness

Low cost

Fig 23-6 Cost-effectiveness plane. The difference (∆) in effectiveness (∆Effectiveness) and cost (∆Cost) between a comparable treatment and a control treatment are analyzed. Cost-effectiveness ratios lying in the first quadrant indicate that the proposed treatment is less effective but costlier than the standard treatment, hence the proposed treatment is rejected. On the other hand, when cost-effectiveness ratios lie in the third quadrant, the proposed treatment is immediately accepted because it is always favorable (ie, proposed treatment is more effective and at lower cost than the standard). However, ratios located in the second and fourth quadrants, for the proposed treatment may be considered if the costeffectiveness ratio is favorable over the standard treatment. (Reprinted from Abrahamyan et al64 with permission.)

3 High effectiveness

(repaired denture). Any treatment that lies in this quadrant would not even be considered. Treatments falling in the lower right quadrant (third quadrant) are highly desirable because these treatments are more effective and less costly than the standard treatment. The lower left quadrant (fourth quadrant) are treatments that have questionable consideration but not excluded because a slightly lower effective treatment at a significant cost saving may be justified. For example, offering Ruth a thin clear mouthguard with tooth replacement, although not as effective as repairing her current denture, is less costly and may possibly be more cost effective. However, analyzing the economics of treatment falling in the second quadrant (upper right side), where often the treatment is more effective but also costlier than the standard control treatment, is the most typical scenario. In this clinical example, effectiveness based on Ruth’s preference is measured in units of QAPY, and cost refers only to Ruth’s out-of-pocket cost for each treatment option (Table 23-5). Ruth’s incremental cost-effectiveness ratio (ICER) between any specific proposed upgraded prosthetic treatment (ie, referred to as the proposed treatment) against just repairing her current removable acrylic denture (ie, referred to as the standard treatment) is given as follows:

347

Brunette-CH23.indd 347

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

Table 23-5 | Incremental cost-effectiveness ratio Options Repair

Expected cost ($)

EUV (QAPY)

Incremental cost (Δcost)

Incremental utility (ΔQAPY gained)

ICER ($/QAPY gained)

300

0.60

-

-

Implant

4,200

4.37

3,900

3.77

-

FPD

3,535

3.96

3,235

3.36

963

Cast RPD

1,750

1.88

1,450

1.28

1,133

1,034

ICER, incremental cost-effectiveness ratio, ie, the quotient of the difference between the cost and effectiveness of one of the upgraded prostheses to just repairing her current prosthesis.

ICER =

(Cui – Cr) (Eui – Er)

where: Cui = Cost of upgrade prosthetic options (i) Cr = Cost of repaired prosthesis Eui = Effectiveness (utility) of upgrade prosthetic options (i) as per their EUV Er = Effectiveness (utility) of repaired prosthesis as per its EUV For example, Ruth needs to pay $3,900 more for the dental implant over the repaired denture for a personal gain of 3.77 QAPY, which reflects an ICER of $1,034 per QAPY gained. Intuitively, this means Ruth must be willing to pay $1,034 to gain one QAPY with an implant over a repaired denture. The ICER for the FPD is slightly lower ($963), and the ICER for the new cast RPD ($1,133) is slightly higher than the ICER for the dental implant. ICERs are helpful at comparing alternative treatment options but are not often used to make straightforward decisions. For example, in this case the ICERs for all the upgrades to a new prosthesis are too similar to help Ruth make a confident decision. This situation results from the implant being more effective and costlier than the FPD, and the FPD being more effective and costlier than the cast RPD; their respective ICERs are too similar to definitively determine the best option without considering the value (preferably in monetary terms) of what a gained QAPY for each upgraded prosthesis represents to Ruth. Therefore, to further evaluate the cost-effectiveness of treatment options from Ruth’s point of view requires knowing the monetary value that Ruth puts per gained QAPY for each prosthetic upgrade. In other words, Ruth wants to make sure the cost per QAPY gained (ICER) is not greater than what she is willing to pay per QAPY

gained. A requirement to a rational decision dictated that Ruth determines what a QAPY is worth to her. Therefore, further evaluation of the cost-effectiveness is needed to determine whether the differences in costs and monetary value to Ruth are large enough for any of the upgraded treatment options to represent greater value to Ruth over just repairing her current denture. Establishing a monetary threshold for a unit of effectiveness can be complicated, especially when trying to establish a value at the society and community level.26,27,44,45 However, at the individual patient level, the willingness to pay (WTP) approach can be used to put a dollar value on the patient perceived preference for a certain health outcome. WTP simply asks the decision makers to determine how much money they are willing to forego to receive a desirable state of health. Although conceptually easy to understand, the logistics of carrying out a meaningful WTP analysis has been problematic and fraught with bias.65–69 Nevertheless, this method in principle lends itself well to evaluating dental treatment because often there is a direct exchange of money between the patient and the dentist.70–76 When a third party has responsibility for full or partial payment, the interactions become more complicated as patients must calculate the actual cost (not the cost of the treatment), and this depends on the specific reimbursement policies of the third party. One way of measuring Ruth’s WTP is asking her to go through a thought experiment involving an iterative bidding algorithm to determine her monetary value for each treatment option.72,74 For example, the approach could start by offering an arbitrarily defined value, say for the sake of example, $5,000 for an implant. If this is higher than her WTP, then she is offered the same dental implant at $2,500. If she is comfortable paying this much, she is then asked if she is willing to pay $3,750 for the implant. This back-and-forth bidding game continues until Ruth reaches a price of $4,500,

348

Brunette-CH23.indd 348

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

Table 23-6 | Willingness-to-pay threshold Options Repair

WTP ($) 300

ΔWTP (WTPui – WTPr) ($)

Incremental utility (ΔQAPY gained)

WTPt ($/QAPY gained)

-

-

-

Implant

4,500

4,200

3.77

1,167

FPD

3,500

3,200

3.36

952

Cast RPD

1,500

1,200

1.28

938

The (WTPt) is the monetary value (as determined in a thought experiment) that Ruth is willing to pay for an incremental gain in QAPY for the proposed treatment (ie, implant, FPD, and cast RPD) relative to least costly standard treatment option (ie, repair).

Table 23-7 | Cost-effectiveness analysis Options

WTPt ($/QAPY gained)

ICER ($/QAPY gained)

WTP threshold—ICER Δ($/QAPY gained)

Cost-effective

Repair

-

-

-

-

Implant

1,167

1,034

133

Yes

FPD

952

1,005

–53

No

Cast RPD

938

1,133

–195

No

The implant is the only cost-effective treatment option relative to the repair because its WTPt > ICER.

above which is higher than her WTP. Ruth undergoes a similar bidding approach for the FPD, cast RPD, and repaired denture, generating WTP values of $3,500, $1,500, and $300, respectively. With these WTP values, Ruth’s threshold (WTPt) for a gained QAPY for each upgraded prosthesis can be calculated as follows (Table 23-6): (WTPui – WTPr) WTPt = (E – Er) ui where: WTPt = willingness to pay threshold WTPui = willingness to pay for upgrade prosthetic options (i) WTPr = willing to pay for repaired prosthesis (standard treatment) Eui = Effectiveness of upgrade prosthetic options (i) Er = Effectiveness of repaired prosthesis The cost-effectiveness analysis for each “upgrade” treatment option is summarized in Table 23-7 and graphically shown in the cost-effectiveness planes in Fig 23-7. The analysis concludes that only the implant WTPt is above its respective ICER, thus making the dental implant a more attractive and cost-effective option than repairing the current denture, from Ruth’s

perspective. However, the WTP threshold for the FPD and cast RPD options falls below their respective ICERs, leading Ruth to conclude that these options are not more cost-effective from her perspective than simply repairing the denture. This analysis applies to Ruth and is not transferable to other patients. This is because the utility values and WTP value are unique to the values that Ruth assigns to the options and not necessarily those of Susan, the next patient waiting in reception, who might have a similar decision to make but will likely have different utilities and WTP values for the three options than Ruth.

Limitations of decision tree analysis DTA offers a transparent approach to making decisions. It is open to constructive criticism and thus to improvement as new information becomes available. However, DTA is meant to guide treatment, not dictate treatment. To decide merely on the result of such analysis is to lose track of the forest (ie, clinical context) in which this single decision tree is located. DTA attempts to simplify decision making by trying to examine the most common relevant factors and outcomes. It is based on estimates that are prone to error and bias that arithmetically propagate through

349

Brunette-CH23.indd 349

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

∆ Cost +

PY QA d ne ai PY g r QA e d 7p ne ai ,16 g 1 r $ = pe ) 4 nt 3 a pl ,0 (im $1 t = TP W t) an pl

Cimplant– Crepair = $3,900

(im ER IC

$1,167 –

+

1 QAPY

∆ Effectiveness (Eimplant– Erepair) = 3.77 QAPY



a

∆ Cost PY QA

+

ed in PY ga r QA pe d e 05 ain ,0 rg $1 e = 2p ) PD 95 (F R $ E = IC D) FP t ( P T W

CFPD– Crepair = $3,235 $952 –

+

1 QAPY

∆ Effectiveness (EFPD– Erepair) = 3.36 QAPY

– b Fig 23-7 Cost-effectiveness planes for (a) implant, (b) FPD, and (c) cast RPD relative to the repair. Although all treatment options fall in the second (potentially cost-effective) quadrant, only the implant option is considered cost effective relative to the repair because the implant is the only treatment where its WTPt line is above its respective ICER line.

350

Brunette-CH23.indd 350

10/9/19 12:12 PM

Decision Making in the Context of Patient Care

∆ Cost +

ER IC

Ccast RPD– Crepair = $1,450 $938

) PD tR as (c

P WT

– 1 QAPY

=

3 ,13 $1

D) RP st ca ( t

d ne ai g r pe

8 93 =$

PY QA

Y AP dQ e ain rg pe

+

∆ Effectiveness (Ecast RPD– Erepair) = 1.28 QAPY

– c Fig 23-7 (cont)

the tree’s branches with the risk of resulting in a meaningless EUV. This is because the EUV is the product of a series of probabilities and the utility along the tree’s branch. The errors associated at each probability and the utility are also compounded, possibly resulting in a standard error so large that confidence is lost in the precision of the estimated EUV. Indeed, as generally practiced, the propagation of errors is not calculated, and confidence intervals are not produced. This lack of precision is especially true when the trees become large and more complex and such calculations would be time consuming. Although sensitivity analysis attempts to address this issue, it can only be done by varying one (one-way sensitivity analysis), two (two-way sensitivity analysis), or maybe even three (three-way sensitivity analysis) factors at a time. Varying more than three factors is complex and difficult to interpret.23,27,36 DTA analysis works well when the outcomes are mutually exclusive and specific over a limited time horizon. For example, Ruth’s decision tree assumes that every option is an all-or-nothing outcome where if Ruth chooses the implant and it fails, then Ruth is left with no choice but to live with no tooth until the 5-year time span is reached. But in reality, Ruth may be in one of many different health states over the 5-year time

span. For example, if at the start of the time span, Ruth decides on the implant and it fails to integrate after 6 months, she still has the option to consider the other tooth replacement procedure to live with over the next 4½ years. If at this point she chooses the denture and it fails in a year, then she may elect to try the implant again or even live with an FPD for the next 3½ years. In other words, Ruth may be in one of five health states (FPD, implant, cast RPD, acrylic denture, or missing front tooth) during any point in time over the 5-year time span. Constructing a decision tree that considers all possible transitional health states is complex, especially if the time span is not 5 years but Ruth’s entire lifetime. Furthermore, in such long-term cases, the probability and utility of options may change. For example, Ruth’s utility for an implant at age 40 is likely to be different than at age 90. The clinical scenario where a patient may go through multiple cycles of decision making and hence different health states over time can be analyzed using Markov models or Monte Carlo simulation. These sophisticated statistical approaches have been employed in complex chronic health conditions often seen in dentistry, such as periodontal disease. Many such analyses on dental conditions have been published.77–81 However, these

351

Brunette-CH23.indd 351

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

complex statistical analyses, discussed in detail elsewhere,24,26,27 are beyond the scope of this book, and are in any case not widely employed in dental offices at present. Decision tree and economic analysis are often used in society at the policy-making level where insurance agencies or governments must decide on which health care programs are worth funding for a large population. Although it is understood that outcomes are not absolute, it is hoped that the policy will be advantageous for most people over the long run (ie, the greatest good for the greatest number), the utilitarian philosophy of Jeremy Bentham. Such decisions are typically made by politicians and bureaucrats who must balance health needs with other competing interests. As resources are often limited, pragmatic but not optimal decisions may be made. For example, a social welfare program may only provide a removable denture because that choice represents the most efficient way of spending limited public economic resources to improve oral health and function. However, at the individual level, it is an all-ornothing outcome. Ruth is making a one-time decision among the treatment options, and her choice will either work or not. Hence, the decision is one not only involving the probability of gaining a benefit but also of the individual psychology of how Ruth will feel if a decision works out (ie, winning) or not (ie, losing). The growing science of behavioral economics pioneered by Tversky and Kahneman (prospect theory, see also chapter 22) shows that individuals are generally risk averse; they favor avoiding a loss more than gaining an equivalent win.82–84 This attitude is counter to the axioms of utilitarian theory, which assumes that people are risk neutral or rational risk takers who choose the behavior that results in maximum personal welfare.15

Supplementation of Decision Tree Analysis with Other Approaches to Decision Making Despite its limitations, DTA can be used as an approach to help dentists and their patients make rational choices when confronted with health care decisions. In this age of patient-centered care, the focus on evidencebased decision making enables patients to be actively involved in an effective manner, choosing options that reflect their unique situations and perceptions. The patients’ limited technical knowledge necessitates the

incorporation of professional advice in the process. The DTA approach is based on the three guiding principles of evidence-based dentistry: (1) best science, (2) clinical expertise, and (3) patient preferences and values. Evidence-based decision making provides a good example of good practice in critical thinking as it involves careful examination of alternatives and consequences of decisions. A decision tree is designed to give a simplified representation of what is often a complex clinical scenario. Hence it can help patients understand the context of uncertainty and trade-offs made among competing treatment options. These probabilities are adjusted based on clinical experience, published scientific literature, and patient context. In particular, the dentist needs to consider how well the patient in the dental chair is represented by the populations used in the studies underlying the probabilities used in the analysis. Patient preference for each outcome is assessed quantitatively through patient-specific utility measurements. The robustness of the decisions is tested through sensitivity analyses, which allows patients to reflect on “what if” scenarios. Guiding patients through a decision tree that closely resembles their clinical context allows them to understand uncertainty and the steps they can take to mitigate it. However, in some circumstances, supplemental qualitative modes of decision making need to be incorporated into the process. For example, an individual patient may require detailed consideration of how the proposed treatment may affect other teeth or general function or occlusion. This calls to mind our example of an experienced periodontist using the “moral algebra” approach of Ben Franklin, considering every tooth in the patient’s head prior to making a decision on tooth extraction. Other considerations for which quantitative data might not be available include the capability of the patient to undertake oral hygiene procedures or dietary recommendations that may be essential to the success of the proposed treatment. Another issue is that of time. Evidence-based decision making that includes assessment of patient preferences and ability to pay may take some time to accomplish both intrinsically because of the time necessary to administer the methods as well as the education required to enable the patient to participate in the methodologies involved. With practice, the dental professional may become more time efficient with the methods, and there is a possibility that ancillary personnel can be used effectively to assess such parameters as willingness to pay. Over the long term, such procedures may be cost effective for the dentist as

352

Brunette-CH23.indd 352

10/9/19 12:12 PM

References

they seem to be likely to increase patient satisfaction and retention in the practice. Another decision-making approach that can be relevant in some decisions is the “one big reason” method employed by Darwin in his decision to marry. In emergency situations, for example, the need for a timely decision may be so urgent that there is no time for detailed consideration or consultation. Thus, in emergency rooms dealing with trauma patients, a general surgeon may be called upon to stabilize the position of the mandible but may not have the time to consult with dental personnel or the experience to optimize the occlusion. Nevertheless, the “one big reason” approach to making the decision to undertake the procedure may be appropriate in the context of need for immediate action.

References 1.

2. 3. 4.

5. 6. 7. 8. 9.

10. 11.

12. 13. 14.

15.

Osler W, Silverman ME, Murray TJ, Bryan CS, American College of Physicians—American Society of Internal Medicine.The Quotable Osler. Philadelphia: American College of Physicians, 2008. Groopman JE. How Doctors Think. Boston: Houghton Mifflin, 2008. Gigerenzer G, Todd PM, ABC Research Group. Simple Heuristics That Make Us Smart. Oxford: Oxford University, 1999. Raab M, Gigerenzer G. The power of simplicity and fast-andfrugal heuristics approach to performance science. Front Psychol 2015;6:1672. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Boca Raton: Chapman & Hall, 1993. Gigerenzer G, Todd PM, ABC Research Group. Simple Heuristics That Make Us Smart. Oxford: Oxford University, 1999:7–15. Tort P. Charles Darwin: The Scholar Who Changed Human History. London: Thames & Hudson, 2001. Costa JT. The impish side of evolution’s icon. American Scientist 2018;106:104–111. Prasad V, Ioannidis JP. Evidence-based de-implementation for contradicted, unproven, and aspiring healthcare practices. Implement Sci 2014;9:1. Gillam S, Siriwardena AN. Evidence-based healthcare and quality improvement. Qual Prim Care 2014;22:125–132. Fearing G, Barwick M, Kimber M. Clinical transformation: Manager’s perspectives on implementation of evidence-based practice. Adm Policy Ment Health 2014;41:455–468. Walshe K, Rundall TG. Evidence-based management: From theory to practice in health care. Milbank Q 2001;79:429–457. ADA: Center for Evidence-Based Dentistry. About EBD. https:// ebd.ada.org/en/about. Accessed 19 May 2018. Institute of Medicine. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academy, 2001. von Neumann J, Morgenstern O. Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition). Princeton: Princeton University, 2004.

16. Rubel RA. Decision analysis: A mathematical aid for clinical decisions. Med Times 1968;96:1033–1040. 17. McNeil BJ, Keller E, Adelstein SJ. Primer on certain elements of medical decision making. N Engl J Med 1975;293:211–215. 18. Fineberg HV. Medical decision making and the future of medical practice. Med Decis Making 1981;1:4–6. 19. Pauker SG, Kassirer JP. Decision analysis. N Engl J Med 1987;316:250–258. 20. Detsky AS, Naglie G, Krahn MD, Naimark D, Redelmeier DA. Primer on medical decision analysis: Part 1—Getting started. Med Decis Making 1997;17:123–125. 21. Detsky AS, Naglie G, Krahn MD, Redelmeier DA, Naimark D. Primer on medical decision analysis: Part 2—Building a tree. Med Decis Making 1997;17:126–135. 22. Naglie G, Krahn MD, Naimark D, Redelmeier DA, Detsky AS. Primer on medical decision analysis: Part 3—Estimating probabilities and utilities. Med Decis Making 1997;17:136–141. 23. Krahn MD, Naglie G, Naimark D, Redelmeier DA, Detsky AS. Primer on medical decision analysis: Part 4—Analyzing the model and interpreting the results. Med Decis Making 1997;17:142–151. 24. Naimark D, Krahn MD, Naglie G, Redelmeier DA, Detsky AS. Primer on medical decision analysis: Part 5—Working with Markov processes. Med Decis Making 1997;17:152–159. 25. Philips Z, Ginnelly L, Sculpher M, et al. Review of guidelines for good practice in decision-analytic modelling in health technology assessment. Health Technol Assess 2004;8:iii–iv,ix–xi,1–158. 26. Sox HC, Higgins MC, Owens DK. Medical Decision Making, ed 2. Chichester: Wiley-Blackwell, 2013. 27. Hunink MGM, Weinstein MC, Wittenberg E, et al. Decision Making in Health and Medicine: Integrating Evidence and Values, ed 2. Cambridge: Cambridge University, 2014. 28. Abrahamyan L, Pechlivanoglou P, Krahn M, et al. A practical approach to evidence-based dentistry: IX: How to appraise and use an article about economic analysis. J Am Dent Assoc 2015;146:679–689. 29. Aquilino SA, Shugars DA, Bader JD, White BA. Ten-year survival rates of teeth adjacent to treated and untreated posterior bounded edentulous spaces. J Prosthet Dent 2001;85:455–460. 30. Setzer FC, Kim S. Comparison of long-term survival of implants and endodontically treated teeth. J Dent Res 2014;93:19–26. 31. Chrcanovic BR, Albrektsson T, Wennerberg A. Smoking and dental implants: A systematic review and meta-analysis. J Dent 2015;43:487–498. 32. Moraschini V, Barboza E. Success of dental implants in smokers and non-smokers: A systematic review and meta-analysis. Int J Oral Maxillofac Surg 2016;45:205–215. 33. Whitworth JM, Walls AW, Wassell RW. Crowns and extra-coronal restorations: Endodontic considerations: The pulp, the root-treated tooth and the crown. Br Dent J 2002;192:315–320, 323–327. 34. Rehmann P, Orbach K, Ferger P, Wöstmann B. Treatment outcomes with removable partial dentures: A retrospective analysis. Int J Prosthodont 2013;26:147–150. 35. Vermeulen AH, Keltjens HM, van’t Hof MA, Kayser AF. Ten-year evaluation of removable partial dentures: Survival rates based on retreatment, not wearing and replacement. J Prosthet Dent 1996;76:267–272. 36. Oxford Living Dictionaries. Outcome. https://en.oxforddictionaries.com/definition/outcome. Accessed 19 May 2018. 37. Bader JD, Ismail AI. A primer on outcomes in dentistry. J Public Health Dent 1999;59:131–135.

353

Brunette-CH23.indd 353

10/9/19 12:12 PM

INTRODUCTION TO CLINICAL DECISION MAKING

38. Mühlbacher AC, Juhnke C. Patient preferences versus physicians’ judgement: Does it make a difference in healthcare decision making? Appl Health Econ Health Policy 2013;11:163–180. 39. Scambler S, Delgado M, Asimakopoulou K. Defining patient-centred care in dentistry? A systematic review of the dental literature. Br Dent J 2016;221:477–484. 40. Scambler S, Gupta A, Asimakopoulou K. Patient-centred care— What is it and how is it practised in the dental surgery? Health Expect 2015;18:2549–2558. 41. Mills I, Frost J, Cooper C, Moles DR, Kay E. Patient-centred care in general dental practice—A systematic review of the literature. BMC Oral Health 2014;14:64. 42. Patient-Centered Outcomes Research Institute website. https:// www.pcori.org. Accessed 18 May 2018. 43. Brazier J, Deverill M, Green C, Harper R, Booth A. A review of the use of health status measures in economic evaluation. Health Technol Assess 1999;3:i-iv. 44. Birch S, Ismail AI. Patient preferences and the measurement of utilities in the evaluation of dental technologies. J Dent Res 2002;81:446–450. 45. Drummond MF, Claxton K, Sculpher MJ, et al. Methods for the Economic Evaluation of Health Care Programmes, ed 4. Oxford: Oxford University, 2015. 46. Nassani MZ, Devlin H, McCord JF, Kay EJ. The shortened dental arch—An assessment of patients’ dental health state utility values. Int Dent J 2005;55:307–312. 47. Fox D, Kay EJ, O’Brien K. A new method of measuring how much anterior tooth alignment means to adolescents. Eur J Orthod 2000;22:299–305. 48. Cohen ME, Arthur JS, Rodden JW. Patients’ retrospective preference for extraction of asymptomatic third molars. Community Dent Oral Epidemiol 1990;18:260–263. 49. Jacobson JJ, Maxson BB, Mays K, Kowalski CJ. A utility analysis of dental implants. Int J Oral Maxillofac Implants 1992;7:381–388. 50. Zitzmann NU, Marinello CP. Treatment outcomes of fixed or removable implant-supported prostheses in the edentulous maxilla. Part I: Patients’ assessments. J Prosthet Dent 2000;83:424– 433. 51. Nassani MZ, Devlin H, McCord JF, Kay EJ. The shortened dental arch—An assessment of patients’ dental health state utility values. Int Dent J 2005;55:307–312. 52. Nassani MZ, Locker D, Elmesallati AA, et al. Dental health state utility values associated with tooth loss in two contrasting cultures. J Oral Rehabil 2009;36:601–609. 53. Ikebe K, Hazeyama T, Kagawa R, Matsuda K, Maeda Y. Subjective values of different treatments for missing molars in older Japanese. J Oral Rehabil 2010;37:892–899. 54. Kay EJ, Nassani MZ, Aswad M, Abdelkader RS, Tarakji B. The disutility of tooth loss: a comparison of patient and professional values. J Public Health Dent 2014;74:89–92. 55. Nassani MZ, Kay EJ, Al-Nahhal TI, Okşayan R, Usumez A, Mohammadi TM. Is the value of oral health related to culture and environment, or function and aesthetics? Community Dent Health 2015;32:204–208. 56. Datarkar A, Daware S, Dande R. Utility of vaccum pressed silicon sheet as a bite raising appliance in the management of TMJ dysfunction syndrome. J Maxillofac Oral Surg 2017;16:342–346. 57. Torrance GW, Feeny DH, Furlong WJ, Barr RD, Zhang Y, Wang Q. Multiattribute utility function for a comprehensive health status classification system. Health Utilities Index Mark 2. Med Care 1996;34:702–722.

58. Tolley K. What are health utilities? http://www.bandolier.org.uk/ painres/download/What%20is%202009/what_are_health_util/ pdf. Accessed 21 May 2018. 59. Brazier J, Deverill M, Green C, Harper R, Booth A. A review of the use of health status measures in economic evaluation. Health Technol Assess 1999;3:1–164. 60. Patrick DL, Starks HE, Cain KC, Uhlmann RF, Pearlman RA. Measuring preferences for health states worse than death. Med Decis Making 1994;14:9–18. 61. Richardson J, Hawthorn D. Negative Utility Scores and Evaluating the AQoL All Worst Health State: Working Paper 113, Centre for Health Program Evaluation, Australia, 2001:4–12. https://www. aqol.com.au/papers/workingpaper113.pdf. Accessed 27 October 2018. 62. Franic DM, Pathak DS. Effect of including (versus excluding) fates worse than death on utility measurement. Int J Technol Assess Health Care 2003;19:347–361. 63. Berglundh T, Persson L, Klinge B. A systematic review of the incidence of biological and technical complications in implant dentistry reported in prospective longitudinal studies of at least 5 years. J Clin Periodontol 2002;29:197–212. 64. Abrahamyan L, Pechlivanoglou P, Krahn M, et al. A practical approach to evidence-based dentistry: IX: How to appraise and use an article about economic analysis. J Am Dent Assoc 2015;146:679–689. 65. Smith RD. The discrete-choice willingness-to-pay question format in health economics: Should we adopt environmental guidelines? Med Decis Making 2000;20:194–206. 66. Frew EJ, Whynes DK, Wolstenholme JL. Eliciting willingness to pay: Comparing closed-ended with open-ended and payment scale formats. Med Decis Making 2003;23:150–159. 67. Currie GR, Donaldson C, O’Brien BJ, Stoddart GL, Torrance GW, Drummond MF. Willingness to pay for what? A note on alternative definitions of health care program benefits for contingent valuation studies. Med Decis Making 2002;22:493–497. 68. Bala MV, Mauskopf JA, Wood LL. Willingness to pay as a measure of health benefits. Pharmacoeconomics 1999;15:9–18. 69. Tan SHX, Vernazza CR, Nair R. Critical review of willingness to pay for clinical oral health interventions. J Dent 2017;64:1–12. 70. Matthews DC, Birch S, Gafni A, DiCenso A. Willingness to pay for periodontal therapy: Development and testing of an instrument. J Public Health Dent 1999;59:44–51. 71. Matthews D, Rocchi A, Gafni A. Putting your money where your mouth is: Willingness to pay for dental gel. Pharmacoeconomics 2002;20:245–255. 72. Matthews D, Rocchi A, Wang EC, Gafni A. Use of an interactive tool to assess patients’ willingness-to-pay. J Biomed Inform 2001;34:311–320. 73. Tamaki Y, Nomura Y, Teraoka K, et al. Characteristics and willingness of patients to pay for regular dental check-ups in Japan. J Oral Sci 2004;46:127–133. 74. Balevi B, Shepperd S. The management of an endodontically abscessed tooth: Patient health state utility, decision-tree and economic analysis. BMC Oral Health 2007;7:17. 75. Leung KC, McGrath CP. Willingness to pay for implant therapy: A study of patient preference. Clin Oral Implants Res 2010;21:789– 793. 76. Augusti D, Augusti G, Re D. Prosthetic restoration in the single-tooth gap: Patient preferences and analysis of the WTP index. Clin Oral Implants Res 2014;25:1257–1264.

354

Brunette-CH23.indd 354

10/9/19 12:12 PM

References

77. Mahl D, Marinello CP, Sendi P. Markov models in dentistry: Application to resin-bonded bridges and review of the literature. Expert Rev Pharmacoecon Outcomes Res 2012;12:623–629. 78. Chi DL, van der Goes DN, Ney JP. Cost-effectiveness of pit-andfissure sealants on primary molars in Medicaid-enrolled children. Am J Public Health 2014;104:555–561. 79. Schwendicke F, Stolpe M. Secondary treatment for asymptomatic root canal treated teeth: A cost-effectiveness analysis. Endod 2015;41:812–816. 80. Skaar DD, Park T, Swiontkowski MF, Kuntz KM. Cost-effectiveness of antibiotic prophylaxis for dental patients with prosthetic joints: Comparisons of antibiotic regimens for patients with total hip arthroplasty. J Am Dent Assoc 2015;146:830–839.

81. Chun JS, Har A, Lim HP, Lim HJ. The analysis of cost-effectiveness of implant and conventional fixed dental prosthesis. J Adv Prosthodont 2016;8:53–61. 82. Kahneman D, Tversky A. Prospect theory: An analysis of decision under risk. Econometrica 1979;47:263–291. 83. Chambers DW. Behavioral economics. J Am Coll Dent 2009;76:55–62. 84. Scarbecz M. “Nudging” your patients toward improved oral health. J Am Dent Assoc 2012;143:907–15.

355

Brunette-CH23.indd 355

10/9/19 12:12 PM

24 Exercises in Critical Thinking



It is difficult to overstate the value of practice. For a new skill to become automatic or for new knowledge to become long-lasting, sustained practice, beyond the point of mastery, is necessary.” DANIEL T. WILLINGHAM

Willingham DT. Practice makes perfect—But only if you practice beyond the point of perfection. Am Educator 2004;Spring:31.

Problems The following excerpts from the dental literature are intended to illustrate some of the concepts introduced in the preceding chapters. To keep the section reasonably brief, the examples have been extracted from papers, and much detail has been omitted. In some instances, the authors discuss the weaknesses or strengths of the particular approach they employed in their article, which is not included here. The intent in presenting these examples is not to criticize or commend the articles in question, but rather to show how the arguments, strategies, and ideas discussed in this book appear in the dental (or popular) literature. To provide a wide range of examples in a reasonable amount of space, only select material was extracted from the papers. Often the problems contain a conclusion drawn from the abstract or summary of the paper and select material from other sections of the paper, such as Materials and Methods and Results, relevant to the conclusion. In approaching these problems, you should assume that the aspects of the conclusions that are not concerned with the information presented in the Materials and Methods and Results extracts are not problematic. For example, in Problem 1 assume that bone gain was actually achieved, even though it is not clear from the material that is presented how bone gain was measured, the time fluoride was applied, and so forth.

Problem 1 Biller T, Yosepovitch Z, Gedalia I. Effects of topical fluoride in the healing rate of experimental calvarial defects in rats. J Dent Res 1977;56:53–56. Summary: “Bone gain was achieved after topical application of fluoride. Fluoride has a strong promoting effect on osteogenesis and accelerates

356

Brunette-CH24.indd 356

10/9/19 12:18 PM

Problems

the repair process of defects in membranous bone. No major histological differences are evident in the newly formed bone.” Materials and Methods: “In one half of the rats that underwent the procedure, a cotton wool swab soaked in 2% acidulated (0.1 M H3PO4) NaF solution was placed in the defect for 20 minutes. The area was then irrigated again with physiological saline solution. The scalp was then sutured. The remaining rats underwent the same local treatment with saline solution and served as controls.” Is this a positive or negative results paper? Is there any explanation, other than the effect of fluoride ions, that could explain the results?

Problem 2 Little JW, Wilson JC, Bickley HC, Bickley C. Effects of parathyroid extract on the rupture strength of intact skin of the rat. J Dent Res 1977;56:46–47. Summary: “Rats were injected with parathyroid extract (PTE) to search for possible effects on connective tissue of the skin. Rupture-strength analysis of skin samples showed a significant increase in strength of skin from PTE-treated rats. The explanation for this effect is not understood at present.” From the Results: “PTE-treated rats lost weight during the 3-day experimental period. The mean starting weight in grams for the PTE-treated groups was 149.75; for the vehicle-treated group, the mean was 148.20; and for the saline solution–treated group, the mean was 153.80. There were no significant differences among groups. At the time of death, the mean weight in grams for the PTE-treated group was 127.63; for the vehicle-treated group it was 162.50; and for the saline solution– treated group, the mean was 163.20. The mean weight at death of the PTE-treated group was significantly different from that of the two control groups . . . (ie, the vehicle-treated group and saline solution–treated group).” Are there any problems with the interpretation of the increase in rupture strength as a specific effect of PTE?

Problem 3 Nacht M. A devitalizing technique for pulpotomy in primary molars. J Dent Children 1956;22:45–47.

“A review of literature revealed some interesting work done by several men with mummifying pastes. In 1929 Dr Hess, of the University of Zurich, reported negative results in a bacterial analysis of 62 pulps mummified with formaldehyde paste. Dr H. R. Foster of Oakland in 1936 describes a successful treatment using formocresol and paste. Again in 1939 Dr K. A. Easlick used the same treatment substituting paraformaldehyde. In the Handbook of Dental Practice—1948, Dr Charles Sweet describes a treatment using zinc oxide, cresolated formaldehyde and eugenol. Since the principle of mummification had been used by so many of these eminent men at various times, it was decided that this might be the answer to our problem.” Identify the main form of argument used in this paragraph.

Problem 4 Brekke JH, Bresner M, Reitman MJ. Effect of surgical trauma and polylactate cubes and granules on the incidence of alveolar osteitis in mandibular third molar extraction wounds. J Can Dent Assoc 1986;4:315–319. “The polylactic acid surgical dressing material, in either cube or granular form, substantially reduces the incidence of alveolar osteitis in healthy patients if all other principles of careful surgical technique are observed.” Identify the technique used in this sentence that could deflect possible criticism of the major findings on the dressing material.

Problem 5 Cipes MH, Miraglia M, Gaulin-Kremer E. Habits, monitoring and reinforcement to eliminate thumbsucking. ASDC J Dent Child 1986;53:48–52. “Although contingency contracting is a widely used strategy for involving parents in modifying their children’s behavior, this approach has apparently not been applied to the elimination of thumbsucking. . . . This paper explores monitoring and contingency contracting as alternative treatments for the persistent thumbsucker.” Identify the logical technique used in constructing this approach and the investigational strategy involved in adopting this approach to the problem.

357

Brunette-CH24.indd 357

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Problem 6

Problem 8

Hoad-Reddick G. Gagging: A chairside approach to control. Br Dent J 1986;161:174–176. “An attempt was made to help people who were unable to wear dentures owing to an exaggerated gagging reflex. Nineteen patients (7 women and 12 men) were taught a controlled method of breathing based on that recommended by the National Childbirth Trust for use by women in labor. The technique is described and this approach to control is related to work done elsewhere. Fourteen of the patients now wear dentures full time. “In this study, patients who . . . were unable to wear dentures at all owing to retching problems were encouraged to make one further attempt at denture-wearing, using a breathing technique based on that recommended by the National Childbirth Trust for use by women in labor. “Landa suggests that the majority of patients show a history of a precipitating cause. In an examination of personalities of dental patients who retched while attempting to wear dentures, Wright used Eysenck Personality Questionnaires. There was no evidence to suggest that retching patients were more neurotic than the control group. Most workers agree that retching is multifactorial in origin. “All patients were instructed in controlled rhythmic breathing and told to practice it for one or two weeks before prosthetic treatment commenced. . . . Identify the logical technique used in constructing this approach and the investigational strategy involved in adopting this approach to the problem. Are there any negative results reported here?

Grenby TH, Desai T. A trial of lactitol in sweets and its effects on human dental plaque. Br Dent J 1988;164:383–387. Summary: “Thirty subjects aged 18–20 years ate boiled sweets made with either sucrose or lactitol in addition to their normal diet over a 3-day experimental period and ceased all oral hygiene. Plaque accumulating on the teeth over the 3 days was assessed by three different methods, all of which showed lower values on the lactitol than on the sucrose sweets (P < 0.005 by the photographic method; P = 0.025 by the gravimetric method). Plaque collected from the lactitol-sweets group contained less soluble carbohydrate, glucose and sucrose, but was relatively higher in protein, calcium and phosphorous, than that from the sucrose-sweets group. There were unfavorable reactions to the texture and gastric effects of the lactitol sweets. . . .” Materials and Methods: “Sweets: These were the popular mint-humbug type, black-and-white striped boiled sweets, weighing approximately 3 g. The conventional sweets were made from a blend of sucrose and glucose syrup, which after boiling, contained 90% sucrose and 10% glucose. The experimental sweets contained 100% lactitol by weight, with additional sweetening by acesulfam-K. They were supplied in 4 oz (113 g) packs, which were identified by the experimental subjects according to color-coding only, not by composition or sweetening agent.” Are there any alternative hypotheses to explain these data?

Problem 7 Robinson PJ, Shapiro IM. Effect of diphosphonates on root resorption. J Dent Res 1966;55:166. “These results indicate that under the conditions of this in vivo model system, diphosphonate does not retard the rate of root resorption. In addition, 1.0% pyrophosphate or 2% sodium fluoride are no more effective than physiological saline in inhibiting root resorption.” Are these positive or negative results? What questions would you want answered when you read the paper?

Problem 9 Kawazoe Y, Kotani H, Hamada T, Yamada S. Effect of occlusal splints on the electromyographic activities of masseter muscles during maximum clenching in patients with myofacial-pain-dysfunction syndrome. J Prosthet Dent 1980;43:578–580. “If the elimination of occlusal interferences causes a decrease in the degree of tactile afferent impulses from periodontal receptors, the masseter muscle activity during maximum clenching with the splint should be reduced more than without a splint (intracuspal clenching) in patients with MPD syndrome having occlusal interferences.” Identify the form of the deductive argument used and, if relevant, any additional premises that might be added to make the argument valid.

358

Brunette-CH24.indd 358

10/9/19 12:18 PM

Problems

Problem 10 Eggleston DW. The interrelationship of stress and degenerative diseases. J Prosthet Dent 1980;44:541–544. “If dental plaque were the only etiologic factor for caries and periodontal disease then all people with dental plaque would have these diseases. Such is not the case. Primitive humans on their natural diet have the lowest incidence of dental caries and periodontal disease even though they have no devices for removal of plaque.” Identify the form of deductive argument used here and any additional rhetorical technique that contributes to the force of the argument.

Problem 11 Following up on the argument on chalones presented in chapter 5: Iversen OH. Comments on chalones and cancer. Mech Ageing Dev 1980;12:211–212. “The correct syllogism is: Chalones cause proliferation decay and thus tumor regression. Tumor regression is most often not followed by cure in human cancer cases. Ergo: Chalones will most often not cure cancer.” Is this syllogism valid?

Problem 12 Sussman MI. Tooth reimplantation when it follows unintentional evulsion utilizing synthetic bone. Oral Health 1986;76:29. “Since all 32 teeth were present, it was decided to attempt to save this tooth via periodontal surgery.” Comment on the logic used in this sentence.

2. Mild abrasion: Few micropits or prism ends (ameloblastic pits) visible between perikymata lines. 3. Severe abrasion: Distinct perikymata lines with many prism ends and/or micropits visible on the whole surface, occasional fracturing of perikymata edge.” Discuss this measurement and indicate the type of statistical test that would be used to compare surfaces before and after air polishing.

Problem 14 Henderson CW, Schwartz RS, Herbold ET, Mayhew RB. Evaluation of the barrier system, an infection control system for the dental laboratory. J Prosthet Dent 1987;58:517–521. “On that particular day a different technician was used and the results did not agree with the rest of the data. . . . If the results on that day were eliminated from the data, after the first cleansing there would be no positive cultures for sodium hypochlorite, and after the second cleansing only one positive culture. This would improve the results of 3.25% sodium hypochlorite.” Identify the type of error alluded to by the author in this excerpt.

Problem 15 Meinig DA. Removable partial dentures without rests. J Prosthet Dent 1994;71:350–358. “Very poor—could lose their remaining teeth in two to three years Poor—could lose their remaining teeth in 3 to 5 years Fair—could lose several teeth but not all of them Good—probably will keep all of their teeth for their lifetime.” Comment on the classification system for periodontal condition used in this study. Is this an operational definition?

Problem 13 Kontturi-Närhi V, Markkanen S, Markkanen H. Effects of airpolishing on dental plaque removal and hard tissues as evaluated by scanning electron microscopy. J Periodontol 1990;61:334–338. In this paper, the following scale was described. “The condition of enamel surface was classified into three groups on the basis of photographs. 1. No abrasion: Smooth normal enamel surface.

Problem 16 Stach DJ, Cross-Poline GN, Newman SM, Tilliss TS. Effect of repeated sterilization and ultrasonic cleaning on curet blades. J Dent Hyg 1995;69:31–39. “The blades pretreated with the anticorrosive and then autoclaved were the most difficult to evaluate via SEM photographs because the product itself appears to leave a visible residue on the blade surface.”

359

Brunette-CH24.indd 359

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Identify the type of measurement problem experienced in this study.

Problem 17 Novak MJ, Polson AM, Adair SM. Tetracycline therapy in patients with early juvenile periodontitis. J Periodontol 1988;59:366–372. “1. The distance from the cementoenamel junction (CEJ) to the alveolar bone crest. The CEJ was designated as that point where the outer edge of the crown intersected the outer edge of the dentin of the root. The alveolar bone crest for a specific tooth surface was defined as the most coronal point of bone adjacent to the tooth surface where the periodontal ligament space had a uniform width. If an oblique flaring of the periodontal ligament space occurred coronally, the alveolar crest was taken as the point immediately subjacent to the flare, where the ligament space still exhibited uniform width.” Comment on this measurement. Does it meet Wilson’s criteria (see page 153)?

Problem 18 Murray ID, McCabe JF, Storer R. Abrasivity of denture cleaning pastes in vitro and in situ. Br Dent J 1986;16:137–141. Summary: “The six-month abrasion scores . . . show a similar result to that recorded at one month except that differences between materials have diminished. This is due to the fact that the maximum possible abrasion score is 4 and a number of G and K dentures reached this value well before six months.” Identify the experimental tactical problem that occurred in this study.

Problem 19 Triol CW, Mandanas BY, Juliano GF, Yraolo B, Cano-Arevalo M, Volpe AR. A clinical study of children comparing anticaries effect of two fluoride dentifrices. A 31-month study. Clin Prev Dent 1987;9:22–24. Summary: “A negative control dentrifice (nonfluoridated) was not used in this program, since the water supply was below [optimal] levels of fluoridation, and total abstention from fluoride was considered not to be good dental practice. ”

Comment on the use of controls in this study.

Problem 20 Cao CF, Aeppli DM, Bloomquist WF, Bandt CL, Wolff LF. Comparison of plaque microflora between Chinese and Caucasian population groups. J Clin Periodontol 1990;17:115–118. From the Abstract: “This investigation was designed to compare the predominant plaque micro-organisms from a Chinese group of patients exhibiting periodontitis with an age-, sex- and periodontal disease–matched Caucasian group of patients. In addition to race, the 2 population groups differed with respect to diet and oral hygiene habits, or effectiveness at removing plaque. Clinical measurements were determined along with an evaluation for micro-organisms in supragingival and subgingival plaque. Although the Chinese and Caucasian population groups were similar with respect to composition of microorganisms in subgingival plaque, notable differences were observed in supragingival plaque. The microbial differences observed in supragingival plaque may be explained at least in part, if not totally, by the higher plaque index scores of the Chinese versus Caucasian population groups.” From Materials and Methods: “10 visiting male Chinese students or scholars at the University of Minnesota, aged 25 to 40 years (mean age + SD = 31.3 + 4.4 years), with symptoms of gingival bleeding, were evaluated for periodontal disease. The criteria for selection of Chinese patients included less than 2.5 years residence in the USA. . . . Caucasians were selected from previous studies for comparison. . . . There was no statistically significant difference between Chinese and Caucasians in this study with respect to the gingival index.” From the Results: “The proportions of spirochetes, motile rods and cocci in plaques are shown. . . . The 2 groups differ significantly with respect to all three microbial forms in supragingival plaque.” From the Discussion: “The greater amount of plaque in the Chinese subjects can probably best be explained by differences in oral hygiene habits. Differences in the microbial composition of supragingival plaque between Chinese and Caucasians is likely attributable to the age and quantity of plaque, since old plaque is inhabited by a more complicated microflora. . . . However, the general similarity of subgingival cultivatable flora in the two population groups suggests that subgingival plaque

360

Brunette-CH24.indd 360

10/9/19 12:18 PM

Problems

Table P-1 | Mean severity scores of ulcers Parameter A/P sequence

Chlorhexidine gel

Placebo gel

.84

1.31

1.03

1.11

All patients (20)

.93

1.22

Adjusted means

0.94

1.21

P/A sequence (9)

Difference in means = 0.27





SE of the difference = 0.115





Used with permission from Addy M, Carpenter R, Roberts WR. Br Dent J 1976;141:118.

is more dependent on its microenvironment than on the composition of the adjacent supragingival plaque, race or diet.” How would you classify this study design? What are some of the difficulties in coming to definitive conclusions with this approach?

Problem 21 Addy M, Carpenter R, Roberts WR. Management of recurrent aphthous ulceration: A trial of chlorhexidine gluconate gel. Br Dent J 1976;141:118–120. Some relevant information extracted from the paper includes: “Thirty patients agreed to participate in the trial. They were chosen from a larger group of aphthous-ulcer sufferers who regularly attended dental school and who experienced regular and frequent ulceration. The trial was conducted in a double-blind crossover manner employing an active gel containing 1 percent chlorhexidine gluconate in an aqueous base. Each gel was used for a period of 35 days with 14 days between the two preparations to avoid carry-over effects. At the commencement of the trial each patient was examined and then verbally instructed on the use of gels. Thus, each gel was to be used 3 times a day after meals. The patients were requested to place approximately 2.5 cm of the gel on the index finger, carry it to the mouth and allow the gel to distribute itself throughout the mouth, any residue being swallowed. The patients were also instructed to record the number and duration of the ulcers and to describe the discomfort experienced according to an arbitrary scale of: 1 = uncomfortable;

2 = fairly painful; 3 = very painful.” See the results in Table P-1. The difference in means of 0.27 between the chlorhexidine treatment and the placebo treatment was significant (P < .05). From the Discussion: “Chlorhexidine gluconate as a 1 percent gel produced a significant reduction in the duration and discomfort of ulcers in a group of 20 patients when compared with a placebo gel.” Identify the experimental design and discuss any problems that there might be in the analysis of the data.

Problem 22 Addy M, Moran J, Davies RM, Beak A, Lewis A. The effect of single morning and evening rinses of chlorhexidine on the development of tooth staining and plaque accumulation. J Periodontol 1982;9:134–140. From the Materials and Methods: “Verbal and written instructions were given at the commencement of each period. Thus, during the rinsing period subjects refrained from all forms of oral hygiene and excluded from their diet coffee, red wine and port. Each was provided with a supply of a branded tea in bags and requested to consume eight cups per day. An attempt to standardize the teas was made by suggesting that one bag should be placed in a cup of boiling water for 2 min. The resulting infusion was then sweetened to taste and all volunteers agreed to add milk.” Identify the experiment tactic used in this study on tooth staining.

361

Brunette-CH24.indd 361

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

SAT scores

800

600

400

200 $10,000 $15,000 $20,000 $25,000 Parent’s income

Fig P-1 Relationshiop of income to aptitude test scores. (From the Ubyssey, October 7, 1980.)

Problem 23 On October 7, 1980, the Ubyssey (a student newspaper at the University of British Columbia) reported that aptitude tests (such as SAT, CSAT, MSAT) show an economic slant. The most striking test bias, they claimed, was the tendency to rank people by income. The following data were presented (Fig P-1). Give various interpretations of the data in Fig P-1.

Problem 24 Lindeberg RW. Combined management of mucogingival defects with citric acid root conditioning, lateral pedicle grafts, and free gingival grafts. Compend Contin Educ Dent 1985;6:265–266,268,270–272. From the Abstract: “This article describes a new technique for managing mucogingival defects. The procedure uses a lateral pedicle graft along with citric-acid root conditioning to gain better connective-tissue attachment to denuded root surfaces; it also uses a free gingiva at the donor tissue site. The technique has been used successfully in five patients, demonstrating good esthetics and root coverage with sound clinical attachment, stable for an extended period of time.” From Materials and Methods: “Five patients, two men and three women, ranging in age from 26–45 with a mean of 36, were included in

this study. Nine teeth with mucogingival defects demonstrating localized gingival recession were treated. The case described below involved a mandibular central incisor. . . . ” (Twenty-two clinical photos were presented.) Summary: “This clinical study demonstrates an effective technique to eliminate mucogingival defects by a combination approach of lateral pedicle graft, free gingival graft, and citric-acid root conditioning. It has been shown that lateral pedicle grafts should not be performed without first conditioning the root with citric acid to guarantee good clinical attachment to the exposed root.” Comment on the relationship between the conclusions as presented in the summary and the evidence presented in this paper.

Problem 25 Havenaar R. The anti-cariogenic potential of xylitol in comparison with sodium fluoride in rat caries experiments. J Dent Res 1984;62:120–123. “On the anti-cariogenic potential of xylitol in rat caries experiments: From day zero, the experimental diet and drinking water were supplied ad libitum. On day two the animals were inoculated once with a suspension of Streptococcus mutans strain 50 B4, serotype d, containing at least 108 CFU/mL.” Comment on the experiment tactics.

Problem 26 Douglas WH, Fields RP, Fundingsland J. A comparison between the microleakage of direct and indirect composite restorative systems. J Dent 1989;17:184–188. In a study on microleakage, the authors varied the type of resin and whether the resin was applied directly or indirectly (ie, the composite is cured outside the mouth and cemented into the prepared cavity using a dual-cure resin-based cement). Their data provides microleakage in micrometers (standard deviation) (Table P-2). “Statistical analysis was done by a one-way analysis of variance and comparison between the means using the Bonferroni test (see Table [P-3]).” Identify the experiment design and suggest another approach to the statistical analysis of these data.

362

Brunette-CH24.indd 362

10/9/19 12:18 PM

Problems

Table P-2 | Microleakage of various resins

Table P-3 | Obtained differences between means and significance levels

Group 1 Resin 1 Direct 772 (257)

2 Resin 2 Direct 338 (105)

3 Resin 1 Indirect 171 (117)

4 Resin 2 Indirect 27 (22)

 sed with permission from Douglas WH, Fields RP, FundingU sland J. J Dent 1989;17:184.

Problem 27 Jensen ME, Kohout F. The effect of a fluoridated dentifrice on root and coronal caries in an older adult population. J Am Dent Assoc 1988;117:829–832. From Methods and Materials: “A total of 810 volunteers were selected who were 54 years of age and older, had at least ten natural teeth, and were living in nonfluoridated communities. Individuals currently receiving fluoride therapy, antibiotics, or those with severe periodontal disease were excluded from the study. . . . “After baseline examinations, the subjects were separated by gender, intervals of age (< 64, > 64), and arrayed by intervals of clinical root caries (0, > 1). Within strata, subjects were assigned to treatment groups by random permutations of 2.” Comment on the procedure used to allocate the subjects into groups. What are the advantages and costs of using this approach?

Groups

1

2

3

4

1



P < 0.001

P < 0.001

P < 0.001

2

434.5



P < 0.01

P < 0.001

3

595.6

161.1



P < 0.01

4

744.6

310.1

149.0



Used with permission from Douglas WH, Fields RP, Fundingsland J. J Dent 1989;17:184.

“Every time I throw them bouquets, it comes back to haunt me. But when I praise them, they don’t play well the next game. Evidently this team would rather be kicked than stroked.” What statistical phenomenon would appear to be at work that would explain Coach Ley’s observations?

Problem 30 Oguntebi BR, DeSchepper EJ, Taylor TS, White CL, Pink FE. Postoperative pain incidence related to the type of emergency treatment of symptomatic pulpitis. Oral Surg Oral Med Oral Pathol 1992;73:479–483. From the data in Table P-4, how could you determine if pain was associated with type of treatment? Assuming that you obtained a statistically significant result, what would be a difficulty in interpreting it?

Problem 31 Problem 28 Sheiham A, Smales FC, Cushing AM, Cowell CR. Changes in periodontal health in a cohort of British workers over a 14-year period. Br Dent J 1986;60:125–127. “Of the 659 people examined in 1966, 120 were contacted and 89 accepted the invitation to be examined.” Identify the possible sources of invalidity in this study.

Problem 29 Former Vancouver Canucks’ head coach Rick Ley once coached the Hartford Whalers. It was reported on February 3, 1990, in the Globe and Mail that he commented as follows on motivating the Whalers:

Review data taken from a study on the effects of a mouthrinse on dental caries in children (Table P-5). Why do you think the investigators used 87 per group?

Problem 32 A study examined the fluoride-releasing pattern of five glass ionomers (identified here as A, B, C, D, and E), each type being cured for 20, 40, and 60 seconds and then measured for the fluoride released after 24, 48, 72 hours as well as 7 days. The authors state, “The statistical analyses were performed only for the materials light-cured for 40 seconds, in view of the reports that the best polymerization and physical characteristics are achieved after light curing for 40 seconds.” The authors calculated t scores for the 10 possible

363

Brunette-CH24.indd 363

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Table P-4 | Postoperative pain incidence for emergency treatment of symptomatic pulpitis*

Table P-5 | Effects of a mouthrinse on dental caries in children

Pulpotomy

Partial pulpotomy

Complete pulpotomy

Pain

30

44

14

No pain

364

302

202

Total

394

346

216

Treatment

DMFT

SD

n

Mouthrinse + placebo

4.78

4.66

87

Mouthrinse + fluoride

5.06

5.41

87

DMFT = decayed, missing, or filled teeth.

*Total no. of patients is 956. Data from Oguntebi BR, DeSchepper EJ, Taylor TS, White CL, Pink FE. Oral Surg Oral Med Oral Pathol 1992;73:479–483.

Table P-6 | Results of calcium hydroxide pulp capping done on Group 1 and Group 2 patients Anesthetic used

Teeth treated (no.)

Unsuccessful procedures (no.)

Rate of unsuccessful procedures (%)

Group 1

Local anesthetic

9

2

27.2

Group 2

General anesthetic (no local anesthetic)

8

1

12.5

Data from Gallien GS Jr, Schuman NJ. J Am Dent Assoc 1985;111:599–601.

comparisons (ie, A-B, A-C, A-D, A-E, B-C, B-D, B-E, C-D, C-E, D-E) at two times: 24 hours and 1 week. The t scores were then used to determine whether there were statistically significant differences between the amounts of fluoride released by the different ionomers. Comment on the statistical analysis.

Identify the type of result reported in this segment. Does it alleviate the concern over the change of criteria? c) Judging from the title of the paper, what other possible threats to validity would you suspect?

Problem 34 Problem 33 Evans DJ, Rugg-Gunn AJ, Tabari ED, Butler T. The effect of 25 years of water fluoridation in Newcastle assessed in four surveys of 5-year-old children over an 18-year period. Br Dent J 1995;178:60–64. a) From the Materials and Methods: “The criteria for the clinical examination for caries used in the previous studies was modified to reflect the criteria used in surveys co-ordinated by the British Association for the Study of Community Dentistry.” From the Results: “Comparison with data from previous studies is tentative due to the change in criteria adopted.” What kind of threat to validity does the statement “comparison with” exemplify? b) From the Results: “However data collected from the 110 duplicate examinations by the two examiners allowed a comparison to be made between the results using two different criteria. . . . This difference was not statistically significant.”

Gallien GS Jr, Schuman NJ. Local versus general anesthesia: A study of pulpal response in the treatment of cariously exposed teeth. J Am Dent Assoc 1985;111:599–601. Review Table P-6 from the Results section. What rule of presentation of data is violated here?

Problem 35 Martin JA, Bader JD. Five-year treatment outcomes for teeth with large amalgams and crowns. Oper Dent 1997;22:72–78. “In 1988 a total of 7,687 target restorations were placed in 5,901 enrolled adult patients. Of these, 4,735 restorations were placed in 3,655 members whose eligibility was continuous from 1988 to 1993. Thus we were able to follow 62% of both the patients and eligible target restorations placed in 1988 to 1993.” The restorations of those who dropped out of the program were not followed for the entire 5-year period. What is the main threat to the internal validity of this study? How can the authors attempt to determine if the threat seriously compromises their conclusions?

364

Brunette-CH24.indd 364

10/9/19 12:18 PM

Problems

Problem 36

Persuading the professor

Sherman PR, Hutchens LH Jr, Jewson LG, Moriarty JM, Greco GW, McFall WT Jr. The effectiveness of subgingival scaling and root planing. I. Clinical detection of residual calculus. J Periodontol 1990;61:3–8. a) From Methods of Measurement: “An unsatisfactory surface was defined as, ‘that surface which in the individual evaluator’s judgment could be made smoother with further instrumentation.’ . . . For use in comparison of the clinical and microscopic detection of calculus, a surface was considered clinically positive for calculus when at least two of the three evaluators determined the surface to be satisfactory.” Using the criteria on page 153, in what stage in the development of a measurement scale does this fall? What do you think the next step in the development of this measurement should be? b) “A second appointment, approximately 1 week later, was scheduled to allow the operator an opportunity to evaluate the instrumented surfaces and accomplish any additional scaling and root planing. Although no time constraints were imposed, the total amount of time spent in the use of ultrasonics and the time spent in hand instrumentation per tooth were recorded.” What principle in experiment design is illustrated here?

This cautionary tale demonstrates the various rhetorical strategies routinely employed by salespeople to persuade consumers to part with their hard-earned funds. While the bill of goods in this story is a mutual fund, the same strategies are applied every day in the selling of dental equipment, cars, and other necessities and luxuries of life. More importantly, they are also used in science: Persuasion is a key element of scientific papers, and scientists regularly interact in decision-making situations, such as when serving on grant committees and tenure boards. In each of these arenas, they use their persuasive abilities to convince their colleagues of the rightness of their beliefs.

Problem 37 Yankell SL, Emling RC. A study of gingival irritation and plaque removal following a three-minute toothbrushing. J Clin Dent 1994;5:1–4. “If there are ‘no significant differences’ among new designs in efficacy, perhaps safety of use factors should be given increased attention and reporting.” Comment on the logic of this sentence and how it may be criticized.

Problem 38 The following rhetorical strategies can be found in the following problem (some more than once): authority, data manipulation, scarcity, proposing of alternatives, credibility, agenda control, law of cognitive response, contrast, concreteness, liking, and social validation. Identify the locations where they occur.

Paragraph 1 Professor Brunette felt a strong tug of jealousy as he stood in the doorway of his office chatting about financial matters with his colleagues Waterfield and Tonzetich. Both of these men had discovered a path to wealth and had long since left Brunette behind, choking on their dust. Waterfield had reached financial nirvana by the clever stratagem of having his wife work in a company with stock options that had multiplied in value. Tonzetich had arrived at his comfortable position by a lifetime of rigorous financial discipline. As a young married man, for example, he had abstained from purchasing a car so that he could save for a house. Brunette had neither the foresight of a Waterfield to marry a woman employed by a thriving corporation nor the self-discipline of a Tonzetich. Yet he felt that he deserved more; he believed he worked as hard as his friends and was equally intelligent. What could he do? Paragraph 2 At that moment a courier approached and asked Brunette to sign for a package. Inside the package he found a letter and an audio cassette. The letter explained that the cassette contained information about the financial services provided by Mr Edgar Endwater, who specialized in selling investments to faculty of the University of British Columbia (UBC). His credentials included a BA (commerce), CA (Chartered Accountant), and CFP (Certified Financial Planner). The investments that Mr Endwater sold would then be integrated into the UBC pension plan—a plan that he described in his letter as “conservative with historically low rates of return.” Now, Brunette had recently served as an expert witness at a trial and was looking for a place to invest the $25,000 he had been paid. He called Endwater.

365

Brunette-CH24.indd 365

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Paragraph 3 One day later, Mr Endwater appeared at his door. Tall, lean, and clad in an impeccable English-tailored suit, he made Brunette feel stumpy and distinctly unfashionable in his casual attire characteristic of academics. But it turned out they had much in common. Both lived in the fashionable Point Grey area and coached in the soccer club. Mr Endwater enthused about how much he liked working with the children and their parents. Brunette could only admire such a selfless soul; he had found many of the children to be spoiled brats and their parents pretentious yuppies. He felt that Endwater must be quite a good fellow not to mind some of the annoyances associated with coaching Point Grey children. Paragraph 4 Mr Endwater outlined his qualifications and explained why he had decided to specialize in working with UBC professors. “The amount you can save from your salary is, as you know, quite small. What I’m really interested in is being your agent when you retire and you’re ready to invest your UBC pension fund. I don’t really make any money from these small investments,” he sniffed. (Brunette had mentioned that he wanted to invest $20,000, and, in response to an earlier request, had provided Endwater with a copy of his UBC pension statement.) ”But I want to do a good job on the small investments so that you’ll have the confidence to make me your financial planner when you retire.” Paragraph 5 When Brunette questioned why UBC professors, who would be expected to be intelligent people, didn’t handle their own investments, Mr Endwater replied “That’s a very astute question. UBC professors are indeed very intelligent people—no doubt much smarter than me—but they generally don’t have the time, the interest, or the background to do all the work necessary to evaluate investments.” He paused and looked around Brunette’s office. “Judging by all the awards you’ve received, you’ve worked very hard; you couldn’t have accomplished what you have without working hard. Particularly,” he added with a smile, “when you’re busy participating in sports with your children. But I don’t want to take too much of your time, so let’s get down to business.” He sat in the seat opposite Brunette. Paragraph 6 “You mentioned that you’re interested in equity investments in US dollars—a very wise move in my opinion; it’s common knowledge that the US dollar is more stable than the Canadian. Moreover, as you know,

RRSP (Registered Retirement Savings Plan) are limited by Canadian law in the number of US stocks they can hold so they don’t appreciate as quickly because the Canadian stocks don’t increase as rapidly as the US ones. Many financial planners believe Canadians should invest in US stocks. I recommend the McTavish St Bernard Action Research Plan.” He underlined the S–t–A–R. “This plan will make you an investment ‘StAR.’” Paragraph 7 The StAR plan was a collection of funds from the McTavish company. With a flourish, Mr Endwater handed Brunette a stack of brochures for each of the funds. The brochures featured colorful graphs and detailed the names of all the stocks purchased by the funds. Mr Endwater continued, “The sophisticated StAR plan involves reallocating assets among the plans on a strategic basis. In this way the volatility of the funds is reduced, hence lowering your risk of losing your investment. The StAR plan is designed by a professor of finance from McGill. His research has appeared in a high-impact financial journal.” He rose from his chair and walked a few steps to examine a painting on the wall. “Of course, having the advantage of this StAR system costs a bit more because more administration is required, but the additional 0.5% added to the standard management expense ratio of 3% is well worthwhile, in my opinion, particularly when you can expect such high rates of return. Twenty percent is possible.” Paragraph 8 “I’ve heard that the Templeton funds are very good,” Brunette said. “How does the StAR plan compare to those?” “Oh, it’s better,” said Mr. Endwater. “Last year StAR made more money than Templeton Growth. Moreover, I suggest that you move immediately. The market is poised to move up. Better for you to get that October growth spurt than to buy in later and pay a premium. Just let me know what you decide, and I can send a courier to pick up your check.” That night, during his nightly walk with his dog Brutus, Brunette visited Mr Endwater’s house and delivered the check. Mr Endwater’s house looked rather more expensive than his own, Brunette noticed.

Conclusion What happened to St Bernard Action Research Plan? It did not do well. Brunette took to calling it “Dog-ofMy-Heart Mutual Fund.” Only later did Brunette learn

366

Brunette-CH24.indd 366

10/9/19 12:18 PM

Comments on Problems

that investing guru Warren Buffet had predicted that, over time, the most likely return on equities would be around 6%. That, combined with high management expense ratios (MERs) of 3.5%, meant that the fund managers were benefitting more from the investors’ wealth than the investors themselves. By receiving “trailer commissions,” agents received a chunk of the MER for themselves for as long as the investor held the fund. Mr Endwater thus did well on Brunette’s investment. Brunette himself did not do so well; although he didn’t actually lose money on StAR, neither did he make as much money as he could have through other strategies. Moreover, he got out in time. When the stock market declined sharply in 2000, many funds, including StAR, were folded. Today, when McTavish distributes the performance of its funds, StAR isn’t listed because it no longer exists. Like a bad doctor, McTavish has buried its mistake.

Problem 39 A modern Cicero I hopped up onto the #14 bus that runs between UBC and downtown. I was going to a Whitecaps soccer game and wearing a Whitecaps sweater; I figured I looked like a sporty middle-aged male. The bus was crowded with UBC students. After I got on the bus, the very large bus driver boomed out: “Listen, people. I shouldn’t have to tell you this—when you see somebody get on the bus who obviously needs a seat, get up and give it to him.” The bus driver must obviously have studied the Roman orator Cicero, who taught that in rhetoric one should charm, teach, and move. In any case, the effect on the students was electric; it was as if he had applied a cattle prod to their posteriors. I now had a choice of seats and the students were hanging from straps or reaching for slippery poles for support, for it had become socially reprehensible and thus morally impossible for them to take a seat. A couple of other old fossils emerged from the mass of standing students and smilingly ensconced themselves on unoccupied seats. My ego and sporty self-image were damaged, but my posterior and old bones were comfortably accommodated for the trip downtown. Analyze the bus driver’s rhetorical technique.

Comments on Problems These comments should be viewed as neither the definitive nor indeed the only answer to the questions posed in the problem set. Other views are possible, and I suspect that in some instances the authors would be prepared to argue with these comments.

Comment on problem 1 This question illustrates the often complex nature of treatments as applied. Considered as a treatment, the results of this positive-effects paper indicate that fluoride treatment promotes bone formation. However, it is difficult to ascribe the effect solely to the fluoride ion because the treatment is complex, for an acidulated (0.1 M H3PO4) solution is used. An alternative hypothesis is that the acid, rather than the fluoride treatment, promotes bone formation. This concern could be alleviated by introducing another control in which only acid is applied.

Comment on problem 2 The usual reason for using specific biologic response modifiers such as PTE is to look for specific effects. In this example there was an effect on rupture strength of intact skin but also weight loss in the PTE-treated group. It thus appears that there were some more general effects on the animals’ metabolism and the authors rightly conclude that the explanation for the increase in strength of skin produced by PTE is not understood.

Comment on problem 3 This is a clear example of argument to authority. An aspect of this excerpt that I find charming is that it includes a compliment to the cited authorities, quite different from the common style today where such compliments are absent and grudging citations common.

Comment on problem 4 The phrase “if all other principles of careful surgical technique are observed” exemplifies the technique of

367

Brunette-CH24.indd 367

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

using an auxiliary hypothesis. If anyone failed to repeat the findings on polylactate, the authors could question their surgical technique.

Comment on problem 5 This study exemplifies analogy as a method of designing studies. Every parent—or at least every parent that I know—at some point descends to the level of striking deals with their children as opposed to maintaining the high moral ground of insisting their children do the right thing on principle alone. This paper argues that a similar deal-making (or contingency-contracting) approach could be used by oral health professionals. The outcome is not guaranteed because, as in all analogies, similarity is not identity. Health care professionals share with parents a need to modify children’s behavior sometimes, but the relationship between the child and the parent is different from that between the child and the health care worker. The analogy, however, provides perfectly reasonable grounds for doing the study.

Comment on problem 6 This example involves analogy but also shows the use of the transfer method in research design. In brief, it builds on clinical traditions that gagging can sometimes be controlled by breathing techniques and transfers the techniques developed for women in labor to gaggers. The advantages of such an approach are many. For example, the authors do not have to undergo all of the problems that were involved in developing the method. Moreover, because they are dealing with an established method, readers will probably be less skeptical than if the authors introduced a new method for this study. The negative result was a finding of no difference in the neurotic dimension between the retching and control patients. This finding enables the authors to interpret their results more specifically. They do not have to consider the neuroses of the subjects as influencing the results.

Comment on problem 7 This is a negative-results paper. The question of interest would be: What size effect could the authors have seen using their methods?

Comment on problem 8 Two types of questions emerge. First, the lactitol “treatment” consists of two treatments (ie, lactitols and an artificial sweetener acesulfam-K), so that in theory at least, either could be responsible for effects. Experts in this topic could judge whether the acesulfam-K was a plausible as opposed to a possible explanation. Similarly, another possible problem is whether the unfavorable gastric effects of the lactitol caused the subjects to alter other aspects of their diet in ways that might affect plaque.

Comment on problem 9 I construct the argument in formal terms as a valid pure hypothetical syllogism as follows: If splint, then occlusal interferences (OIs) decrease (a suppressed premise that would appear to be true in many instances). If OIs decrease, then afferent impulses (AIs) decrease. If AIs decrease, then masseter muscle activity decreases.

Comment on problem 10 I construct the argument in formal terms as follows: If dental plaque were the only etiologic factor . . . then all people . . . with plaque have these diseases (ie, if P then Q). It is not the case that all people with plaque have the disease (ie, not Q). This step involves a suppressed premise assuming through the sentence “Primitive humans . . .” that plaque is correlated negatively with the presence of devices for removing plaque. (This may not be the case; a number of investigators have demonstrated devices used by so-called primitive peoples. Devices include such things as twigs that contain antimicrobial agents that might inhibit plaque.) Therefore: Plaque is not the only etiologic factor (ie, not P).

368

Brunette-CH24.indd 368

10/9/19 12:18 PM

Comments on Problems

The conclusion, which is not stated explicitly in the excerpt, is valid (modus tollens). Because of the uncertainty of the suppressed premise, however, it may not be sound.

Comment on problem 11 Although I believe the first part of Iversen’s analysis given in chapter 5 to be correct, I do not agree with the second part. Translating his syllogism into standard form, I get: All chalones are tumor regressors (A proposition). Some tumor regressors are not cancer cures (O proposition). Chalones are not cancer cures.

Comment on problem 14 This is an example of error due to an assignable cause (the different technician). It is a comparatively rare event to see such an error acknowledged so explicitly. But it is certainly better for the authors to state, as they did, how the data are affected by eliminating the suspect points than it is simply not to report the suspect data at all—probably the more common approach.

Comment on problem 15

This syllogism is invalid because the middle term (tumor regressors) is not distributed in either premise. Readers should recall that an invalid conclusion does not necessarily mean a wrong conclusion; invalidity merely means that the truth of the conclusion is not guaranteed by the truth of the premises.

This classification fulfills the first step in Wilson’s criteria for a full-fledged operational definition (that of an intuitive feeling for the quantity). However, given the uncertainties in estimating the survival time of individual teeth, it seems it would be difficult to get agreement between observers. Moreover, verifying the classifications would be difficult because, in all probability, ethics would dictate that treatment be done now, and validation of the classifications would require observing the teeth with no treatment for a considerable time.

Comment on problem 12

Comment on problem 16

This is an example of a suppressed premise. To justify saving the tooth via periodontal surgery, some additional assumptions must be added about the 32 teeth being present. One assumption that would satisfy this need would be “it is necessary to maintain the complete dental arch.” Another might be “and since the patient has demonstrated sufficient oral care to keep the other 31 teeth in good shape, chances are he or she will be able to look after this tooth properly after surgery.” But some may disagree with either of these premises, so it is clear that additional explanation would be helpful.

This is an instance of observational reactivity. The anticorrosive treatment interfered with the SEM observations.

Comment on problem 13

Comment on problem 18

This is an example of ordinal scale data. Statistical purists would not like the authors to average such data and would prefer that the statistical testing use nonparametric methods such as the Mann–Whitney test.

The problem that occurred here was the “ceiling effect.” After groups G and K got to the ceiling value of 4, they had nowhere higher to go, and other groups started to catch up to them, diminishing the difference between groups.

Comment on problem 17 I think this is a good operational definition that specifies precisely how the measurement was done. Length (in this example mm) is part of an international system of standards.

369

Brunette-CH24.indd 369

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Comment on problem 19 This is an instance where ethical considerations eliminate a preferred experiment design; this example illustrates that real-world experimentation often necessarily involves compromises.

Comment on problem 20 In this case-control study, the outcome variable is the plaque microflora. A possible concern is that a number of other factors, such as diet, are not controlled and would be expected to vary systematically between the groups. As in all case-control designs, such differences are potential alternative explanations. Note that the authors argue that such differences are not important because the subgingival flora is similar, but there is still the possibility of a specific interaction between type of plaque and one of the uncontrolled factors.

Comment on problem 21 The power of the crossover design is illustrated in this study in which a small effect could be demonstrated in a relatively small sample of 20. It appears that the authors were taking means of ordinal scale data and analyzing the data using parametric statistics, a procedure that would not necessarily garner the approval of most statisticians.

Comment on problem 22 This is an example of a cardinal rule of experimentation, namely, to make sure that things are set up so that you will see something. In this instance, the authors ensured that tooth staining would occur by the simple expedient of having the subjects drink eight cups of tea a day. If tea drinking had been left to chance, there would have been much more variability, and effects would have been more difficult to demonstrate.

Comment on problem 23 This is an example of the problems of interpreting causation from correlational data. If simply being rich leads to higher SAT scores and thus entrance into desired programs, then there is a problem, since most

people would agree that entrance into programs should be based on merit. There may be a hidden variable in that the wealthier homes may be wealthier because the parents are, in fact, smarter, and these smarter parents produce smarter children. Seligman seems to advocate the latter view in his book, A Question of Intelligence. Others would argue the validity of the SAT tests or the accuracy of the reports of parents’ income. In any case, it is clear that such data cannot be interpreted simplistically.

Comment on problem 24 This case series demonstrates two features often associated with clinical presentations: 1. A complex treatment in which a clinician conscientiously tries to combine the best of several approaches in an attempt to produce the best outcome. The difficulty, from the point of view of determining causes of any successes that occur, is that there are multiple variables of treatment, and one does not know which parts of the treatment are necessary. In this example, the author would have to have a control group in which the roots were not treated with citric acid to demonstrate the necessity of citric acid conditioning. 2. This problem reveals an emphasis on concrete information (clinical photographs) and a lack of statistical analysis. The vividness of the clinical photos tends to convince readers of the efficacy of the treatment even though the case series itself is fairly small (five patients).

Comment on problem 25 This is another example of an investigator taking no chances and setting up conditions so that something will happen. Who knows how long the investigator would have to wait for those rats to develop caries if he hadn’t given them 108 CFU/mL Streptococcus mutans?

Comment on problem 26 Although not treated as such, this experiment appears to employ factorial design with two factors (ie, resin and mode of application) each at two levels (ie, resins 1 and 2, direct and indirect methods of application). If the experiment were analyzed as a factorial design, the

370

Brunette-CH24.indd 370

10/9/19 12:18 PM

Comments on Problems

authors could have tested for an interaction between resin and mode of application, and indeed inspection of the data indicates that such an interaction was present. A positive aspect of their analysis was the use of Bonferroni, a correction for multiple comparisons.

Comment on problem 27 The blocking procedure used here should enable the investigators to determine the effects of treatment on specific groups in terms of age, sex, and root caries score. Note that the cost of this inferential power is that they had to obtain 810 volunteers to get sufficient sample size in the subgroups.

Comment on problem 28 This is a clear instance of sample mortality—the sample shrank to 89 from 659. The problem is that one cannot be sure that the 89 remaining subjects are randomly selected and thus representative of the original 659. Another possible source of invalidity is history, because the cohort marched along 14 years together in a particular period of time; in a different 14-year period, it is possible that different results would be obtained. Sometimes there is nothing investigators can do to deal with such problems. They cannot, for example, force subjects to be examined. In such instances, the most common approach is to discuss the problem and make allowance for it in the interpretation of results.

df = 2 P = .016 Thus pain is significantly associated with type of treatment. A difficulty associated with the interpretation might be what is called confounding by indication. In other words, the choice of treatment might be influenced by the signs and symptoms of the patient, so it may be the preexisting condition of the patient that is causing the association of pain and treatment. Table CP-1 | Expected values Pulpotomy

Partial pulpotomy

Complete pulpectomy

Pain

36.3

31.8

19.9

No pain

358

314

196

Comment on problem 31 This is an example of a treatment that produces only a small effect, so to have any hope of seeing a statistically significant difference, the investigator has to study a large number of subjects. Using the formula for effect size on page 308: dS =

Xa – X b Sp sp =

sp =

86(4.66)2 + 86(5.41)2 = 5.04 172 d=

Comment on problem 29 It looks like Coach Ley is a victim of statistical regression. He thinks his praise results in decreased performance, but in all likelihood after an exceptionally good game his team merely regresses to their mean performance level. In the instance of the Hartford Whalers, that level was not very high, and Coach Ley had to seek employment elsewhere.

Comment on problem 30 One way of interpreting the data would be to use a contingency table (Table CP-1). Using the formula on page 142, we can calculate the expected values for each of the cells and the x2 value. x2 = 8.22

(n a – 1)s2a + (n b – 1)s2b na + n b – 2

5.06 – 4.78 = 0.055 5.04

It is indeed a very small effect size, and it needs a large sample size to be demonstrated.

Comment on problem 32 The use of multiple t tests is inappropriate since it increases the risk of type I error. Analysis of variance (ANOVA) and a multiple comparison test, such as Tukey’s HSD (honestly significantly different) test, would be the preferred approach. An unusual feature of the paper was that only < 17% of the data collected in the paper was subjected to statistical analysis. The reader must wonder whether other analyses were done and the comparisons showed no significant differences.

371

Brunette-CH24.indd 371

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

Comment on problem 33 a) Changing the criteria would be an example of what Campbell and Stanley called “instrumentation,” ie, your measuring instrument is changing. b) This is a negative result, ie, one in which the null hypothesis is accepted. Indeed, it is a negative result that the authors would have been happy to obtain, because it makes interpretation of their study more straightforward. As for all negative results, one worries about how small a difference could have been detected. The authors, however, alleviate that concern by conducting a comparison involving quite a large number of children (110). Thus, it seems that the authors looked very hard for a possible difference and did not find it. c) Eighteen years is a relatively long time, and one would suspect that history in various forms could be operative. For example, was there a change during that time in the socio-economic status of the residents being examined?

Comment on problem 34 The presentation of data has violated Huth’s rule of 50; that is, percentages are only allowed to be reported when the fraction used in calculating them has a denominator (in this case the number of teeth treated) greater than 50. The basis for the rule is that one wants to avoid pseudo-precision such as that implied by the 1/8 = 12.5% reported in Table P-6 for group 2. When the denominator of the fraction is less than 50, the confidence interval (CI) for the reported percentage will be very large. Using the table given in appendix 2, we can calculate that the 95% CI for the rate of unsuccessful procedures (reported as 27.2% in group 1) ranges from 2.8 % to 60%. Similarly for group 2 (reported as 12.5%), the CI ranges from 0.32% to 53%.

Comment on problem 35 The main threat to the validity of this study is mortality. In brief, we wonder whether the 62% who were studied differed in significant ways from those who dropped out. For example, if loss of employment is a prime reason for dropping out of the plan, it seems plausible that patients who dropped out of the plan may differ in socioeconomic status from those who remain; that factor could be linked to treatment outcomes through some intervening variable, such as diet or oral hygiene

practices. Thus, there is the possibility that treatment outcomes observed in those who stayed in the plan cannot be reliably extrapolated to the target population (ie, all patients enrolled in a plan). The authors attempted to deal with this problem by (1) comparing the demographic and dental characteristics of those who stayed with those who dropped out of the plan, and (2) comparing the outcomes of the dropouts with the outcomes of stay-ins to the extent that it was possible with the limited data available for the dropouts.

Comment on problem 36 a) The evaluation of root surface smoothness is a necessary component of practice as well as instruction in clinical periodontics. Thousands of evaluations of this type must be done every day. However, because the methods used in this study are less than ideal, a vote must be taken among the evaluators to measure the property. Thus, it appears that the method used is not much better than the lowest rank of an “intuitive feeling for the quantity.” From the latter part of the sentence, it appears that the authors are working on a method involving microscopy that would take them to the next level, that is “a method of comparison so that it can be said that A has more of the property than B.” Given the importance of this measurement, one imagines that it might become a fertile research area where modification of the types of instruments used in surface science (such as profilometry) might be employed to get standardized readings. b) The principle adopted here seems to be sound— controlling those factors that you can and measuring the rest. The principal aim in this study was to determine the ability of clinicians to detect residual calculus, and because the investigators wanted to include surfaces that had been evaluated as satisfactory, they could not necessarily limit the time spent on preparing the surface. But by measuring the time spent using ultrasonics and hand scaling, they gave themselves the opportunity to gain other insights into the problem—for example, the frequency of satisfactory surfaces as a function of time of hand scaling.

Comment on problem 37 Considered as a deductive argument, this statement makes a number of assumptions.

372

Brunette-CH24.indd 372

10/9/19 12:18 PM

Comments on Problems

1. The studies that reported “no significant differences” were powerful enough to detect clinically meaningful effects. 2. The designs that were investigated were the only ones possible or at least likely to improve plaque removal. 3. The only important considerations are plaque removal and safety. A formal analysis of the logic might be cast along the lines of a disjunctive syllogism as follows: Either new designs or efficacy should be studied (suppressed disjunctive premise). New designs should NOT be studied (as they do not differ in efficacy). Therefore, safety and efficacy should be studied. The logic could be criticized as an example of the UFO fallacy; surely there are other reasons besides the ones stated for studying new designs including appearance, price, and composition, among others. To be fair, the authors hedge their conclusion by including the word “perhaps,” so maybe they were not attempting a definitive conclusion.

Comment on problem 38 A list of the strategies used in each paragraph follows. Paragraph 1 Cognitive dissonance. Brunette experiences discomfort on realizing that his friends, who overtly are no smarter or harder working than he, are nonetheless wealthier. This makes him vulnerable to persuasion from a credible person who offers a simple solution (in this case, a mutual fund). Note that it is Brunette who creates this discomforting state for himself; not uncommonly, the most effective salesman is the customer. Cognitive dissonance also occurs when scientists become dissatisfied with contradictions or gaps in our knowledge and perform experiments to try to resolve them. But sometimes scientists become so enamored of their theories that they fail to undertake critical experiments, or even adjust their data, to avoid contradiction with their favored hypothesis and the cognitive dissonance that it entails. Cognitive dissonance, then, can be used for good or evil.

Paragraph 2 Endwater’s degrees serve as obvious appeals to authority. Paragraph 3 The fact that Endwater wears a suit, silly as it may seem, does add authority to his pronouncements. (I once knew a senior academic administrator, who, I strongly suspected, must have worn his suit to bed. He was a terrible administrator, the numbers in his budget reports never added up, bureaucratic staff multiplied without reason, and, although he was allegedly an academic, no one could really figure out his record in teaching or research—except that he never seemed to have done much of either. But he did look good in a suit and that seemed to form the basis of his authority.) Clothing as a basis for authority is not restricted to academe. The Fortune columnist Stanley Bing has commented that some star CEOs seem to wear shirts that are whiter and crisper than the rest of us. According to Bing, CEOs also have better hair. The moral of this paragraph is to beware of men and women of fine suits but limited substance. A second rhetorical device is demonstrated in this paragraph; Endwater seems to be a friendly guy to whom Brunette takes a liking. Cialdini tells us that liking is one of the tools of effective persuasion. Paragraph 4 Endwater’s methods of establishing credibility and opinion movement are subtle. Endwater (as a salesman) clearly has an agenda: Sales form the basis of his salary. This self-interest reduces the credibility of his advice somewhat. Under conditions of limited credibility, the best strategy for a persuader is to attempt only moderate movement—in this case, get the $20,000 sale and move to the large sale later. Paragraph 5 This is unabashed flattery, pure and simple. People like flattery; they even like flattery that they know to be inaccurate. Flattery leads to liking and, Cialdini tells us, liking leads to persuasion. Paragraph 6 Social validation. Brunette receives some social validation of his thoughts. Phrases such as “common knowledge” and “many planners believe” serve as social validation of Brunette’s plans and make it easier for Endwater to sell products that conform to those plans. The use of “StAR,” an easily memorable acronym, is an example of the value of vivid concrete examples.

373

Brunette-CH24.indd 373

10/9/19 12:18 PM

EXERCISES IN CRITICAL THINKING

More important, Endwater controls the agenda by introducing the McTavish product and thus directing Brunette’s thoughts along a specific line (law of cognitive response). Out of a universe of mutual funds, Brunette’s thoughts are focused on one.

space, and any article that appears in them is thus generally associated with high quality.

Paragraph 7 Authority is used again in the allusion to the McGill professor. Argument from authority is strictly valid only if all reasonable, well-qualified experts agree. It is highly unlikely that all professors of commerce would agree with their McGill colleague. Contrast (another of Cialdini’s tools of persuasion) can be found in the difference between the possible 20% return and the mere 0.5% additional cost of opting for StAR. The problem, of course, is that the 0.5% cost is a certainty whereas the 20% is only a possibility, and an unlikely one at that. But by placing 0.5% in context with 20%, the 0.5% appears small.

Charm. As a classically trained orator Cicero knew that ethos was one of the best techniques for establishing what we would now call communicator credibility. The bus driver presented himself as a dutiful person who did not shirk unpleasant tasks. Teach. The bus driver used the Aristotelian approach of example to teach the wretched students what “an obviously seat-needing person” looked like. Such a person need not be in a wheelchair or using a walker or a cane. A person needing a seat just had to be decrepit. I was the example. Although cunningly disguised in sporty attire, I could easily be discerned by knowledgeable observers to be a man broken down by long years of study, followed by more years of banging his head’s contents against the stony skulls of unappreciative students while being paid a pittance. The common occurrence of retired professors on campus should have ensured that UBC students would have no difficulty recognizing such decrepitude when they saw it. The bus driver in all probability knew the character of his audience, UBC students, who, generally, are polite and courteous people who would be likely to “do the right thing.” Move. A bus driver, by virtue of training and responsibility for the safe operation of the bus, could be considered as a reliable authority on proper passenger behavior, and he was appropriately exerting what is called authority pressure.1 A second operative factor was what is called social proof. 2 In brief, this much-studied phenomenon commonly occurs when people are unsure of themselves or when the situation is unclear or ambiguous. In such circumstances, people become more likely to accept the actions of others as correct. So once the bus driver had moved some students through authority pressure, the mechanism of social proof accelerated the process, so the problem of efficient use of bus seating was not unseated fossils, but rather empty seats.

Paragraph 8 Brunette makes an abortive attempt to consider alternative proposals. If one considers the hypothesis that StAR is the best investment, and given that the supporting evidence comprises brochures from just one firm, it is certainly a good logical approach to criticize the proposal by considering alternatives. However, Brunette’s knowledge is limited and his approach tentative. It is an easy matter for Endwater, with his superior financial credentials, to brush it aside. In fact, history shows that Templeton Growth was a superior fund and that the comparison selected by Endwater (looking to last year) was anomalous. This exchange shows that an expert has a significant advantage when discussing alternatives. Survivorship bias. The average return of the McTavish funds that survive are better than the average of all their funds, because the number calculated for the survivors does not include the numbers generated by the dogs, like StAR, that have been sacrificed to the gods of public relations. Survivorship bias is always an important factor in evaluating experiments or clinical trails where the dropout rate of participants is high. In addition, scarcity (another Cialdini tool of persuasion) can be a strong motivator. Here, the scarce element is time. In science, scarcity appears as a tool of persuasion when an author who is attempting to persuade reviewers of the high quality of his work emphasizes publications in general interest journals (such as Science or Nature). These journals have limited

Comment on problem 39

1. 2.

Cialdini RB. Authority: Directed deference. In: Influence: Science and Practice. Needham, MA: Allyn & Bacon, 2001:178–202. Cialdini RB. Social proofs: Truths are us. In: Influence: Science and Practice. Needham, MA: Allyn & Bacon, 2001:98–142.

374

Brunette-CH24.indd 374

10/9/19 12:18 PM

Appendices

Brunette-App.indd 375

11/6/19 12:11 PM

APPENDICES

Appendix 1 Library of Congress Classification System Main Classifications

A. General Works B. Philosophy; Religion C. History D. World History G. Geography; Anthropology H. Social Sciences I. (Vacant) J. Political Science K. Law P. Language Expanded Classifications (subject areas of interest in oral health research)

Q. Science QH Biology QR Microbiology QS Human Anatomy QT Physiology QU Biochemistry QV Pharmacology QW Bacteriology and Immunology QX Parasitology QY Clinical Pathology QZ Pathology

R. Medicine RA Public Aspects of Medicine RB Pathology RC Internal Medicine RD Surgery RE Ophthalmology RF Otorhinolaryngology

RG Gynecology and Obstetrics RJ Pediatrics RK DENTISTRY RL Dermatology RM Therapeutics; Pharmacology RS Pharmacy and Materia Medica RT Nursing

W. General and Miscellaneous Material Relating to the Medical Profession WA Public Health WB Practice of Medicine WC Infectious Diseases WD Deficiency Diseases WE Musculoskeletal System WF Respiratory System WG Cardiovascular System WH Hemic and Lymphatic System WI Gastrointestinal System WJ Urogenital System WK Endocrine System WL Nervous System WM Psychiatry WN Radiology WO Surgery WP Gynecology WQ Obstetrics WR Dermatology WS Pediatrics WT Geriatrics WU DENTISTRY, ORAL SURGERY WV Otorhinolaryngology WW Ophthalmology WX Hospitals WY Nursing WZ History of Medicine

376

Brunette-App.indd 376

11/6/19 12:11 PM

Appendix 2

Appendix 2 Limits of Binomial Population Percentages of Xs Estimated from Random Samples—No. of Xs in Sample: 0–50; Sample Sizes: 1–100*

This table of binomial population limits can be used to indicate the confidence interval, that is, the zone that we would accept any population as being a possible parent of the sample. Thus if a treatment yielded three failures in the six patients treated, we would look up the values grouped to the right of X = 3 (the number of failures), and find the number 6 (the number in the sample). This entry reads “12; 88,” that is, the failure rate could lie anywhere between 12% and 88%. To save space, the table does not show values above 50%; these must be calculated by taking the opposite class and then subtracting from 100%. For example, suppose there were four failures out of six patients. We would first calculate the confidence interval using the opposite class, ie, successes, which equal two out of six. Referring to the table at X = 2 and moving across to 6, we find “4.3; 78.” Then, subtracting from 100, we find that for failures, the lower limit for percentage of failures is 100 – 78 (upper limit for percentage of success) = 22%. The upper limit for percentage of failures is 100 – 4.3 = 95.7%.

Maximum risks of overestimating lower limit and of underestimating upper limit = 2.5%. X = No. of Xs in sample. Bold-faced figures are sample sizes. Lower and upper limits are separated by semicolons. For explanation and method of use, see chapters 11 and 12.

X = 0

1 0; 97.5 6 0; 46 12 0; 26 25 0; 14 60 0; 6.0

X = 1

2 1.3; 98.74 3 0.84; 91 7 0.36; 58 8 0.32; 53 14 0.18; 34 16 0.16; 30 25 0.10; 20 30 0.08; 17 60 0.04; 9.0 70 0.04; 7.7





2 0; 84 7 0; 41 14 0; 23 30 0; 12 70 0; 5.1

3 0; 71 8 0; 37 16 0; 21 35 0; 10 80 0; 4.5

4 0; 60 9 0; 34 18 0; 19 40 0; 8.8 90 0; 4.0

5 0; 52 10 0; 31 20 0; 17 50 0; 7.1 100 0; 3.62

4 0.63; 81 9 0.28; 48 18 0.14; 27 35 0.07; 15 80 0.03; 6.8

5 0.51; 72 10 0.25; 44.5 20 0.13; 25 40 0.06; 13 90 0.03; 6.0

6 0.42; 64 12 0.21; 38.5 22 0.12; 23 50 0.05; 11 100 0.025; 5.45

X = 2 4 6.8; 93 9 2.8; 60 16 1.6; 38 26 0.95; 25 50 0.49; 14 100 0.24; 7.04

5 5.3; 85 10 2.5; 56 18 1.4; 35 28 0.88; 24 60 0.41; 12

6 4.3; 78 11 2.3; 52 20 1.2; 32 30 0.82; 22 70 0.35; 10

7 3.7; 71 12 2.1; 48 22 1.1; 29 35 0.70; 19 80 0.30; 9

8 3.2; 65 14 1.8; 43 24 1.0; 27 40 0.61; 17 90 0.27; 8

X = 3

6 12; 88 12 5.5; 57 22 2.9; 35 35 1.8; 23 70 0.89; 12

7 9.9; 82 14 4.7; 51 24 2.7; 32 40 1.6; 20 80 0.78; 11

8 8.5; 75.5 16 4.0; 46 26 2.4; 30 45 1.4; 18 90 0.69; 9

9 7.5; 70 18 3.6; 41 28 2.3; 28 50 1.3; 17 100 0.62; 8.53

10 6.7; 65 20 3.2; 38 30 2.1; 27 60 1.0; 14

X = 4

8 16; 84 14 8.4; 58 24 4.8; 37 35 3.2; 27 70 1.6; 14

9 14; 79 16 7.3; 52 26 4.4; 35 40 2.8; 24 80 1. 4; 12

10 12; 74 18 6.4; 48 28 4.0; 33 45 2.5; 21 90 1.2; 11

11 11; 69 20 5.8; 44 30 3.8; 31 50 2.2; 19 100 1.10; 9.93

12 9.9; 65 22 5.2; 40 32 3.5; 29 60 1.9; 16

X = 5

10 19; 81 11 17;77 18 9.7; 53 20 8.7; 49 28 6.1; 37 30 5.6; 35 45 3.7; 24 50 3.3; 22 90 1.8; 12.5 100 1.64; 11.29

12 15;72 22 7.8; 45 32 5.3; 33 60 2.8; 18

14 13;65 24 7.1; 42 35 4.8; 30 70 2.4; 16

16 11; 59 26 6.6; 39 40 4.2; 27 80 2.1; 14

*Used with permission from Mainland D. Elementary Medical Statistics. Philadelphia: WB Saunders; 1952:359–360.

377

Brunette-App.indd 377

11/6/19 12:11 PM

APPENDICES

Appendix 2 (continued) X = 6

12 21; 79 18 13; 59 28 8.3; 41 40 5.7; 30 80 2.8; 16

13 19; 75 20 12; 54 30 7.7; 39 45 5.1; 27 90 2.5; 14

X = 7

14 23;77 22 14; 55 32 9.3;40 45 6.5; 29 70 4.1; 20

15 21; 73 24 13; 51 34 8.7; 38 50 5.8; 27 80 3.6; 17

X = 8

16 25; 75 22 17; 59 32 11; 43 45 8.0; 32 70 5.1; 21

17 23; 72 24 16; 55 35 10; 40 50 7.2; 29 80 4.4; 19

X = 9

18 26; 74 26 17; 56 37 12; 41 50 8.6; 31 80 5.3; 20

19 24; 71 28 16; 52 40 11; 38 55 7.8; 29 90 4.7; 18

X = 10

20 27; 73 28 19; 56 40 13; 41 55 9.1; 31 90 5.5; 19

21 26; 70 30 17; 53 42 12; 39 60 8.3; 29 100 4.90; 17.62

14 18; 71 22 11; 50 32 7.2; 36 50 4.5; 24 100 2.24; 12.60

15 16; 68 24 9.8; 47 35 6.6; 34 60 3.8; 20.5

16 15; 65 26 9.0; 44 37 6.2; 32 70 3.2; 18

16 20; 70 26 12; 48 37 8.0; 35 55 5.3; 24 90 3.2; 15

18 17; 64 28 11; 45 40 7.3; 33 60 4.8; 23 100 2.86; 13.90

20 15; 59 30 9.9; 42 42 7.0; 31 65 4.4; 21

18 22; 69 26 14; 52 37 9.8; 38 55 6.5; 27 90 3.9; 17

19 20; 66.5 28 13; 49 40 9.0; 36 60 5.9; 25 100 3.51; 15.16

20 19; 64 30 12; 46 42 8.6; 34 65 5. 5; 23

20 23; 68 30 15; 49 42 10; 37 60 7.1; 27 100 4.20; 16.40

22 21; 64 32 14; 47 45 9.6; 35 65 6.5; 25

24 19; 59 34 13; 44 47 9.2; 33 70 6.1; 23

22 24; 68 32 16; 50 45 11; 37 65 7.6; 26

24 22; 63 34 15; 47 47 11; 36 70 7.1; 25

26 20; 59 37 14; 44 50 10; 34 80 6.2; 22

X = 11 22 28; 72 32 19; 53 42 14; 42 60 9.5; 30 100 5.62; 18.83

24 26; 67 34 17; 51 45 13; 40 65 8.8; 28

26 23; 63 36 16; 48 47 12; 38 70 8.1; 26

28 21; 59 38 15; 46 50 12; 36 80 7.1; 23

30 20; 56 40 15; 44 55 10; 33 90 6.3; 21

X = 12 24 29; 71 32 21; 56 42 16; 45 60 11; 32 100 6.36; 20.02

25 28; 69 34 20; 53.5 45 15; 42 65 9.9; 30

26 27; 67 36 19; 51 47 14; 40 70 9.2; 28

28 24; 63 38 18; 49 50 13; 38 80 8.0; 25

30 23; 59 40 17; 47 55 12; 35 90 7.1; 22

27 29; 68 34 22; 56 45 16; 44 65 11; 32 100 7.11; 21.20

28 28; 66 36 21; 54 47 16; 43 70 10; 30

29 26; 64 38 20; 51 50 15; 40 75 9.6; 28

30 25; 63 40 19; 49 55 13; 37 80 9.0; 26

X = 13

26 30; 70 32 24; 59 42 18; 47 60 12; 34 90 7.9; 23

378

Brunette-App.indd 378

11/6/19 12:11 PM

Appendix 2

X = 14

28 31; 69 38 22; 54 50 16; 42 75 11; 29

30 28; 66 40 21; 52 55 15; 39 80 9.9; 28

32 26; 62 42 20; 50 60 13, 36 90 8.8; 25

X = 15

30 31; 69 40 23; 54 50 18; 45 75 12; 31

32 29; 65 42 22; 52 55 16; 41 80 11; 29

34 27; 62 44 21; 50 60 15; 38 85 10; 27

X = 16

32 32; 68 42 23; 54 55 18; 43 80 12; 30

34 30; 65 44 22; 52 60 16; 40 90 11; 27

36 28; 62 46 21; 50 65 15; 37 100 9.45; 24.66

38 26; 59 48 20; 48 70 14; 34

40 25; 57 50 20; 47 75 13; 32

X = 17

34 32; 68 44 24; 55 60 17; 41 85 12; 30

36 30; 65 46 23; 52 65 16; 39 90 11; 28

38 29; 61 48 22; 51 70 15; 36 100 10.25; 25.79

40 27; 59 50 21; 49 75 14; 34

42 26; 57 55 19;45 80 13; 32

X = 18

36 33; 67 46 25; 55 65 17; 40 90 12; 30

42 28; 59 55 21; 47 80 14; 33

44 26; 57 60 19; 43 85 13; 31

46 27; 57 65 19; 42 90 13; 31

X = 19

38 33; 67 48 26; 55 70 17; 39 100 11.86; 28.06

38 31; 64 40 29; 62 48 24; 53 50 23; 51 70 16; 38 75 15; 35 100 11.06; 26.92

34 25; 59 45 18; 47 65 12; 33 100 7.87; 22.37 36 26; 59 46 20; 48 65 14; 35 90 10; 26

40 32; 64 50 25; 53 75 16;37

42 30; 61 55 22; 49 80 15; 35

44 28; 59 60 20; 45 85 14; 33

46 29; 59 65 20; 43 90 14; 32

36 23; 57 47 17; 45 70 11; 31

38 24; 57 48 19; 46 70 13; 33 100 8.645; 23.53

X = 20

40 34; 66 50 26; 55 75 17; 38

42 32; 64 55 24; 50 80 16; 36

44 30; 61 60 22, 47 85 15; 34

X = 21

42 34; 66 55 25; 52 80 17;37

44 32; 63 60 23; 48 85 16; 35

46 31; 61 65 21; 45 90 15; 33

48 29; 59 70 20; 42 100 13.51; 30.28

50 28; 57 75 18; 40

44 35; 65 60 25; 50 85 17; 36

46 33; 63 65 23; 47 90 16; 35

48 31; 61 70 21; 44 95 15; 33

50 30; 59 75 19; 41 100 14.35; 31.37

55 27; 54 80 18; 39

46 35; 65 65 24; 48 90 17; 36

48 33; 63 70 22; 45 95 16; 34



X = 22

X = 23

X = 24

48 35; 65 70 23; 47 95 17; 35

50 32; 61 75 20; 42 100 15.19; 32.47

50 34; 63 55 30; 58 75 22; 44 80 20; 41 100 16.03; 33.56

48 28; 57 70 18; 41 100 12.66; 29.19

55 29; 56 80 19; 40

60 26; 52 85 18; 38

60 28; 53 85 19; 39

65 25; 50 90 18; 37

379

Brunette-App.indd 379

11/6/19 12:11 PM

APPENDICES

Appendix 2 (continued) X = 25

X = 26

X = 27

X = 28

X = 29

X = 30

X = 31

X = 32

X = 33

X = 34

X = 35

X = 36

X = 37

X = 38

X = 39

X = 40

50 36; 64 75 23; 45 100 16.88; 34.66

55 32; 59 80 21; 43

60 29; 55 85 20; 40

65 27; 51 90 19; 38

70 25; 48 95 18; 36

52 36; 64 75 24; 47 100 17.75; 35.72

55 34; 61 80 22; 44

60 31; 57 85 21; 42

65 28; 53 90 20; 39

70 26; 50 95 19; 37

54 36; 64 75 25; 48 100 18.62; 36.79

55 35; 63 80 24; 45

60 32; 58 85 22; 43

65 29; 54 90 21.41

70 27; 51 95 20; 39

56 36; 64 80 25; 46

60 34; 60 85 23; 44

65 31; 56 90 22; 42

70 28; 52 95 21; 40

75 26; 49 100 19.50; 37.85

58 37; 63 80 26; 48

60 35; 62 85 24; 45

65 32; 57 90 23; 43

70 30; 54 95 21; 41

75 28; 51 100 20.37; 38.92

60 37; 63 85 25; 46

65 34; 59 90 24; 44

70 31; 55 95 22; 42

75 29; 52 100 21.24; 39.98

80 27; 49

62 37; 63 85 26; 48

65 35; 60 90 25; 45

70 32; 57 95 23; 43

75 30; 53 100 22.14; 41.02

80 28; 50

64 37; 63 85 27; 49

65 37; 62 90 26; 46

70 34; 58 95 24; 44

75 31; 55 100 23.04; 42.06

80 29; 52

66 37; 63 90 27; 47

70 35; 59 95 25; 45

75 33; 56 100 23.93; 43.10

80 30; 53

85 28; 50

68 38; 62 90 28; 49

70 36; 61 95 26; 46

75 34; 57 100 24.83; 44.15

80 32; 54

85 30; 51

70 38; 62 95 27; 47

75 35; 59 80 33; 55 100 25.73; 45.19

85 31; 52

90 29; 50

72 38; 62 95 28; 48

75 36; 60 80 34; 57 100 26.65; 46.20

85 32; 54

90 30; 51

74 38; 62 95 29; 49

75 38; 61 80 35; 58 100 27.57; 47.22

85 33; 55

90 31; 52

76 38; 62 100 28.49; 48.24

80 36; 59

85 34; 56

90 32; 53

95 30; 51

78 38; 62 100 29.41; 49.26

80 37; 60

85 35; 57

90 33; 54

95 31; 52

85 36; 58

90 34; 55

95 32; 53

80 39; 61

100 30.33; 50.28

380

Brunette-App.indd 380

11/6/19 12:11 PM

Appendix 2

X = 41

82 39; 61

85 37; 59

90 35; 56

95 33; 54

100 31.27; 51.28

X = 42

84 39; 61

85 38; 60

90 36; 57

95 34; 55

100 32.21; 52.28

X = 43

86 39; 61

90 37; 59

95 35; 56

100 33.15; 53.27

X = 44

88 39; 61

90 38; 60

95 36; 57

100 34.09; 54.27

X = 45

90 39; 61

95 37; 58

100 35.03; 55.27

X = 46

92 39; 61

95 38; 59

100 35.99; 56.25

X = 47

94 40; 60

95 39; 60

100 36.95; 57.23

X = 48

96 40; 60

100 37.91; 58.21

X = 49

98 40; 60

100 38.87; 59.19

X = 50 100 39.83; 60.17

381

Brunette-App.indd 381

11/6/19 12:11 PM

APPENDICES

Appendix 3 Critical Values of the t Distribution* α (2): 0.50 0.20 0.10 0.05 0.02 0.01 0.005 0.002 0.001 ν α (1): 0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005

1 2 3 4 5

1.000 3.078 6.314 12.706 31.821 63.657 127.321 318.309 636.619 0.816 1.886 2.920 4.303 6.965 9.925 14.089 22.327 31.599 0.765 1.638 2.353 3.182 4.541 5.841 7.453 10.215 12.924 0.741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610 0.727 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869



6 7 8 9 10

0.718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959 0.711 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408 0.706 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041 0.703 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781 0.700 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587



11 12 13 14 15

0.697 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437 0.695 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318 0.694 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221 0.692 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140 0.691 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073



16 17 18 19 20

0.690 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015 0.689 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965 0.688 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922 0.688 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883 0.687 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850



21 22 23 24 25

0.686 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819 0.686 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792 0.685 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.768 0.685 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745 0.684 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725



26 27 28 29 30

0.684 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707 0.684 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690 0.683 1.313 1.701 2.048 2.467 2.763 3.047 3.048 3.674 0.683 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659 0.683 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646

31 32 33 34 35

0.682 1.309 1.696 2.040 2.453 2.744 3.022 3.375 3.633 0.682 1.309 1.694 2.037 2.449 2.738 3.015 3.365 3.622 0.682 1.308 1.692 2.035 2.445 2.733 3.008 3.356 3.611 0.682 1.307 1.691 2.032 2.441 2.728 3.002 3.348 3.601 0.682 1.306 1.690 2.030 2.438 2.724 2.996 3.340 3.591



36 37 38 39 40

0.681 1.306 1.688 2.028 2.434 2.719 2.990 3.333 3.582 0.681 1.305 1.687 2.026 2.431 2.715 2.985 3.326 3.574 0.681 1.304 1.686 2.024 2.429 2.712 2.980 3.319 3.566 0.681 1.304 1.685 2.023 2.426 2.708 2.976 3.313 3.558 0.681 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551



41 42 43 44 45

0.681 1.303 1.683 2.020 2.421 2.701 2.967 3.301 3.544 0.680 1.302 1.682 2.018 2.418 2.698 2.963 3.296 3.538 0.680 1.302 1.681 2.017 2.416 2.695 2.959 3.291 3.532 0.680 1.301 1.680 2.015 2.414 2.692 2.956 3.286 3.526 0.680 1.301 1.679 2.014 2.412 2.690 2.952 3.281 3.520

46 47 48 49 50

0.680 1.300 1.679 2.013 2.410 2.687 2.949 3.277 3.515 0.680 1.300 1.678 2.012 2.408 2.865 2.946 3.273 3.510 0.680 1.299 1.677 2.011 2.407 2.682 2.943 3.269 3.505 0.680 1.299 1.677 2.010 2.405 2.680 2.940 3.265 3.500 0.679 1.299 1.676 2.009 2.403 2.678 2.937 3.621 3.496

ν = degrees of freedom. α(2) = a for two-tailed test. α(1) = a for one-tailed test. *Used with permission from Zar JH. Biostatistical analysis, ed 3. Englewood Cliffs, NJ: Prentice-Hall; 1996:App18.

382

Brunette-App.indd 382

11/6/19 12:11 PM

Appendix 3

Critical Values of the t Distribution* α (2): 0.50 0.20 0.10 0.05 0.02 0.01 0.005 0.002 0.001 ν α (1): 0.25 0.10 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005

52 54 56 58 60

0.679 1.298 1.675 2.007 2.400 2.674 2.932 3.255 3.488 0.679 1.297 1.674 2.005 2.397 2.670 2.927 3.248 3.480 0.679 1.297 1.673 2.003 2.395 2.667 2.923 3.242 3.473 0.679 1.296 1.672 2.002 2.392 2.663 2.918 3.237 3.466 0.679 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460



62 64 66 68 70

0.678 1.295 1.670 1.999 2.388 2.657 2.911 3.227 3.454 0.678 1.295 1.669 1.998 2.386 2.655 2.908 3.223 3.449 0.678 1.295 1.668 1.997 2.384 2.652 2.904 3.218 3.444 0.678 1.294 1.668 1.995 2.382 2.650 2.902 3.214 3.439 0.678 1.294 1.667 1.994 2.381 2.648 2.899 3.211 3.435



72 74 76 78 80

0.678 1.293 1.666 1.993 2.379 2.646 2.896 3.207 3.431 0.678 1.293 1.666 1.993 2.378 2.644 2.894 3.204 3.427 0.678 1.293 1.665 1.992 2.376 2.642 2.891 3.201 3.423 0.678 1.292 1.665 1.991 2.375 2.640 2.889 3.198 3.420 0.678 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416



82 84 86 88 90

0.677 1.292 1.664 1.989 2.373 2.637 2.885 3.193 3.413 0.677 1.292 1.663 1.989 2.372 2.636 2.883 3.190 3.410 0.677 1.291 1.663 1.988 2.370 2.634 2.881 3.188 3.407 0.677 1.291 1.662 1.987 2.369 2.633 2.880 3.185 3.405 0.677 1.291 1.662 1.987 2.368 2.632 2.878 3.183 3.402



92 94 96 98 100

0.677 1.291 1.662 1.986 2.368 2.630 2.876 3.181 3.399 0.677 1.291 1.661 1.986 2.367 2.629 2.875 3.179 3.397 0.677 1.290 1.661 1.985 2.366 2.628 2.873 3.177 3.395 0.677 1.290 1.661 1.984 2.365 2.627 2.872 3.175 3.393 0.677 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390



105 110 115 120 125

0.677 1.290 1.659 1.983 2.362 2.623 2.868 3.170 3.386 0.677 1.289 1.659 1.982 2.361 2.621 2.865 3.166 3.381 0.677 1.289 1.658 1.981 2.359 2.619 2.862 3.163 3.377 0.677 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373 0.676 1.288 1.657 1.979 2.357 2.616 2.858 3.157 3.370



130 135 140 145 150

0.676 1.288 1.657 1.978 2.355 2.614 2.856 3.154 3.367 0.676 1.288 1.656 1.978 2.354 2.613 2.854 3.152 3.364 0.676 1.288 1.656 1.977 2.353 2.611 2.852 3.149 3.361 0.676 1.287 1.655 1.976 2.352 2.610 2.851 3.147 3.359 0.676 1.287 1.655 1.976 2.531 2.609 2.849 3.145 3.357



160 170 180 190 200

0.676 1.287 1.654 1.975 2.350 2.607 2.846 3.142 3.352 0.676 1.287 1.654 1.974 2.348 2.605 2.844 3.139 3.349 0.676 1.286 1.653 1.973 2.347 2.603 2.842 3.136 3.345 0.676 1.286 1.653 1.973 2.346 2.602 2.840 3.134 3.342 0.676 1.286 1.653 1.972 2.345 2.601 2.839 3.131 3.340



250 300 350 400 450

0.675 1.285 1.651 1.969 2.341 2.596 2.832 3.123 3.330 0.675 1.284 1.650 1.968 2.339 2.592 2.828 3.118 3.323 0.675 1.284 1.649 1.967 2.337 2.590 2.825 3.114 3.319 0.675 1.284 1.649 1.966 2.336 2.588 2.823 3.111 3.315 0.675 1.283 1.648 1.965 2.335 2.587 2.821 3.108 3.312



500 600 700 800 900

0.675 1.283 1.648 1.965 2.334 2.586 2.820 3.107 3.310 0.675 1.283 1.647 1.964 2.333 2.584 2.817 3.104 3.307 0.675 1.283 1.647 1.963 2.332 2.583 2.816 3.102 3.304 0.675 1.283 1.647 1.963 2.331 2.582 2.815 3.100 3.303 0.675 1.282 1.647 1.963 2.330 2.581 2.814 3.099 3.301



1000 °

0.675 1.282 1.646 1.962 2.330 2.581 2.813 3.098 3.300 0.6745 1.2816 1.6449 1.9600 2.3263 2.5758 2.8070 3.0902 3.2905

383

Brunette-App.indd 383

11/6/19 12:11 PM

Brunette-App.indd 384

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23 24 25

26 27 28 29 30

31 32 33 34 35

36 37 38 39 40

41 42 43 44 45

46 47 48 49 50





















5.812 6.408 7.015 7.633 8.260

3.053 3.571 4.107 4.660 5.229 6.908 7.564 8.231 8.907 9.591

3.816 4.404 5.009 5.629 6.262

5.578 6.304 7.042 7.790 8.547

7.584 10.341 13.701 17.275 19.675 21.920 24.725 26.757 31.264 8.438 11.340 14.845 18.549 21.026 23.337 26.217 28.300 32.909 9.299 12.340 15.984 19.812 22.362 24.736 27.688 29.819 34.528 10.165 13.339 17.117 21.064 23.685 26.119 29.141 31.319 36.123 11.037 14.339 18.245 22.307 24.996 27.448 30.578 32.801 37.697

7.962 9.312 11.912 15.338 19.369 23.542 26.296 28.845 32.000 34.267 39.252 8.672 10.085 12.792 16.338 20.489 24.769 27.587 30.191 33.409 35.718 40.790 9.390 10.865 13.675 17.338 21.605 25.989 28.869 31.526 34.805 37.156 42.312 10.117 11.651 14.562 18.338 22.718 27.204 30.144 32.852 36.191 38.582 43.820 10.851 12.443 15.452 19.337 23.828 28.412 31.410 34.170 37.566 39.997 45.315

4.575 5.226 5.892 6.571 7.261

8.034 8.897 10.283 11.591 13.240 16.344 20.337 24.935 29.615 32.671 35.479 38.932 41.401 46.797 8.643 9.542 10.982 12.338 14.041 17.240 21.337 26.039 30.813 33.924 36.781 40.289 42.796 48.268 9.260 10.196 11.689 13.091 14.848 18.137 22.337 27.141 32.007 35.172 38.076 41.638 44.181 49.728 9.886 10.856 12.401 13.848 15.659 19.037 23.337 28.241 33.196 36.415 39.364 42.980 45.559 51.179 10.520 11.524 13.120 14.611 16.473 19.939 24.337 29.339 34.382 37.652 40.646 44.314 46.928 52.620

5.142 5.697 6.265 6.844 7.434

2.603 3.074 3.565 4.075 4.601

n = degrees of freedom.

21.929 25.041 26.657 29.160 31.439 34.215 39.220 45.335 52.056 58.641 62.830 66.617 71.201 74.437 81.400 22.610 25.775 27.416 29.956 32.268 35.081 40.149 46.335 53.127 59.774 64.001 67.821 72.443 75.704 82.720 23.295 26.511 28.177 30.755 33.098 35.949 41.079 47.335 54.196 60.907 65.171 69.023 73.683 76.969 84.037 23.983 27.249 28.941 31.555 33.930 36.818 42.010 48.335 55.265 62.038 66.339 70.222 74.919 78.231 85.351 24.674 27.991 29.707 32.357 34.764 37.689 42.942 49.335 56.334 63.167 67.505 71.420 76.154 79.490 86.661

18.576 21.421 22.906 25.215 27.326 29.907 34.585 40.335 46.692 52.949 56.942 60.561 64.950 68.053 74.745 19.239 22.138 23.650 25.999 28.144 30.765 35.510 41.335 47.766 54.090 58.124 61.777 66.206 69.336 76.084 19.906 22.859 24.398 26.785 28.965 31.625 36.436 42.335 48.840 55.230 59.304 62.990 67.459 70.616 77.419 20.576 23.584 25.148 27.575 29.787 32.487 37.363 43.335 49.913 56.369 60.481 64.201 68.710 71.893 78.750 21.251 24.311 25.901 28.366 30.612 33.350 38.291 44.335 50.985 57.505 61.656 65.410 69.957 73.166 80.077

15.324 17.887 19.233 21.336 23.269 25.643 29.973 35.336 41.304 47.212 50.998 54.437 58.619 61.581 67.985 15.965 18.586 19.960 22.106 24.075 26.492 30.893 36.336 42.383 48.363 52.192 55.668 59.893 62.883 69.346 16.611 19.289 20.691 22.878 24.884 27.343 31.815 37.335 43.462 49.513 53.384 56.896 61.162 64.181 70.703 17.262 19.996 21.426 23.654 25.695 28.196 32.737 38.335 44.539 50.660 54.572 58.120 62.428 65.476 72.055 17.916 20.707 22.164 24.433 26.509 29.051 33.660 39.335 45.616 51.805 55.758 59.342 63.691 66.766 73.402

12.196 14.458 15.655 17.539 19.281 21.434 25.390 30.336 35.887 41.422 44.985 48.232 52.191 55.003 61.098 12.811 15.134 16.362 18.291 20.072 22.271 26.304 31.336 36.973 42.585 46.194 49.480 53.486 56.328 62.487 13.431 15.815 17.074 19.047 20.867 23.110 27.219 32.336 38.058 43.745 47.400 50.725 54.776 57.648 63.870 14.057 16.501 17.789 19.806 21.664 23.952 28.136 33.336 39.141 44.903 48.602 51.966 56.061 58.964 65.247 14.688 17.192 18.509 20.569 22.465 24.797 29.054 34.336 40.223 46.059 49.802 53.203 57.342 60.275 66.619

9.222 11.160 12.198 13.844 15.379 17.292 20.843 25.336 30.435 35.563 38.885 41.923 45.642 48.290 54.052 9.803 11.808 12.879 14.573 16.151 18.114 21.749 26.336 31.528 36.741 40.113 43.195 46.963 49.645 55.476 10.391 12.461 13.565 15.308 16.928 18.939 22.657 27.336 32.620 37.916 41.337 44.461 48.278 50.993 56.892 10.986 13.121 14.256 16.047 17.708 19.768 23.567 28.336 33.711 39.087 42.557 45.722 49.588 52.336 58.301 11.588 13.787 14.953 16.791 18.493 20.599 24.478 29.336 34.800 40.256 43.773 46.979 50.892 53.672 59.703

6.447 6.983 7.529 8.085 8.649

3.942 4.416 4.905 5.407 5.921

1.834 2.214 2.617 3.041 3.483

0.381 0.676 0.872 1.237 1.635 2.204 3.455 5.348 7.841 10.645 12.592 14.449 16.812 18.548 22.458 0.599 0.989 1.239 1.690 2.167 2.833 4.255 6.346 9.037 12.017 14.067 16.013 18.475 20.278 24.322 0.857 1.344 1.646 2.180 2.733 3.490 5.071 7.344 10.219 13.362 15.507 17.535 20.090 21.955 26.124 1.152 1.735 2.088 2.700 3.325 4.168 5.899 8.343 11.389 14.684 16.919 19.023 21.666 23.589 27.877 1.479 2.156 2.558 3.247 3.940 4.865 6.737 9.342 12.549 15.987 18.307 20.483 23.209 25.188 29.588

0.000 0.000 0.000 0.001 0.004 0.016 0.102 0.455 1.323 2.706 3.841 5.024 6.635 7.879 10.828 0.002 0.010 0.020 0.051 0.103 0.211 0.575 1.386 2.773 4.605 5.991 7.378 9.210 10.597 13.816 0.024 0.072 0.115 0.216 0.352 0.584 1.213 2.366 4.108 6.251 7.815 9.348 11.345 12.838 16.266 0.091 0.207 0.297 0.484 0.711 1.064 1.923 3.357 5.385 7.779 9.488 11.143 13.277 14.860 18.467 0.210 0.412 0.554 0.831 1.145 1.610 2.675 4.351 6.626 9.236 11.070 12.833 15.086 16.750 20.515

*Used with permission from Zar JH. Biostatistical analysis, ed 3. Englewood Cliffs, NJ: Prentice Hall; 1996:App13.



1 2 3 4 5

384



0.995 0.99 0.975 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.025 0.01 0.005 0.001 ν ∝ = 0.999

APPENDICES

Appendix 4 Critical Values of the χ2 Distribution*

11/6/19 12:11 PM

Appendix 5

Appendix 5 Sources of invalidity*

H

Pre-experimental designs: 1. One-shot case study XO









2. O  ne-group pretest posttest design O X O









?

+

+



3. Static-group comparison X O O

+

+

+

+

+

+

+

+

+

+

+

+

+







5. S  olomon four-group design R O X O R O O R X O R O

+

+

+

+

+

+

+

+

6. P  osttest-only control group design R X O R O

+

+

+

+

+

+

+

+



+

+

?

+

+

+

+

8. E  quivalent time samples design X1O  X2O  X1O  X0O, etc.

+

+

+

+

+

+

+

+

9. Nonequivalent control design O X O O O

+

+

+

+

?

+

+



True experimental designs: 4. P  retest-posttest control group design R O X O R O O

Quasi-experimental designs: 7. T  ime series O  O  O  OXO  O  O  O

MAT

T INST R

S

MORT INTER

Note: In the tables, (—) indicates a definite weakness, (+) indicates that the factor is controlled, (?) indicates a possible source of concern, and a blank indicates that the factor is not relevant. Sources of Invalidity H = History MAT = Maturation T = Testing INST = Instrumentation R = Regression S = Selection MORT = Mortality INTER = Interaction of Selection and Maturation, etc.

Design R = Randomization of subjects into separate groups X = Treatment O = Observation or measurement

*Used with permission from Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand-McNally College Publishing; 1963.

385

Brunette-App.indd 385

11/6/19 12:11 PM

APPENDICES

Appendix 5 (continued) Sources of invalidity*

H

MAT

10. C  ounterbalanced designs X1O X2O X3O X4O X2O X4O X1O X3O X3O X1O X4O X1O X4O X3O X2O X1O

+

11. S  eparate-sample pretest-posttest design R O (X) R X O

— — + ? + + —



12. Separate-sample pretest-posttest control group design R O (X) R X O R O R O

+

+

+

+

+

+

+



13. Multiple time-series O O OXO O O O O OXO O O

+

+

+

+

+

+

+

+

+

T INST R

S

+

+

+

+

MORT INTER +

?

386

Brunette-App.indd 386

11/6/19 12:11 PM

Appendix 6

Appendix 6 Approximate total number of subjects required for significance for different values of d. Note that the values given are for the total number of subjects in the two groups, so that if an investigator suspected an effect size of d = 0.4 and decided on a significance level of 5%, 100 subjects would be required with 50 in each group.*

Total no. of subjects d

5% level

1% level

0.4

100

200

0.5

77

132

0.6

56

97

0.7

38

72

0.8

29

52

0.9

24

38

1.0

20

30

1.2

15

24

1.4

13

20

1.6

11

17

1.8

10

15

2.0

8

12

2.2

8

11

2.4

7

10

2.6

7

9

2.8

7

8

3.0

7

8

*Used with permission from Plutchik R. Foundations of Experimental Research, ed 2. New York; Harper and Row; 1974.

387

Brunette-App.indd 387

11/6/19 12:11 PM

APPENDICES

Appendix 7 The degree of overlap of experimental and control group distribution for different values of d with corresponding percent of variance accounted for (σ2).*

d

Degree of overlap (%)

σ2

0.0

100.00

.00

0.1

92.3

.00

0.2

85.3

.01

0.3

78.7

.02

0.4

72.6

.04

0.5

67.0

.06

0.6

61.8

.08

0.7

57.0

.11

0.8

52.6

.14

0.9

48.4

.17

1.0

44.6

.20

1.1

41.4

.23

1.2

37.8

.26

1.3

34.7

.30

1.4

31.9

.33

1.5

29.3

.36

1.6

26.9

.39

1.7

24.6

.42

1.8

22.6

.45

1.9

20.6

.47

2.0

18.9

.50

2.2

15.7

.55

2.4

13.0

.59

2.6

10.7

.63

2.8

8.8

.66

3.0

7.2

.69

3.2

5.8

.72

3.4

4.7

.74

3.6

3.7

.76

3.8

3.0

.78

4.0

2.3

.80

*Used with permission from Plutchik R. Foundations of Experimental Research, ed. 2. New York: Harper and Row; 1974.

388

Brunette-App.indd 388

11/6/19 12:11 PM

Appendix 8

Appendix 8 Glossary of terms used in statistics and epidemiology Absolute risk: The probability of an event in the population being studied; it is a measure of the population impact. Although heavy smokers have a 20 times greater risk of lung cancer relative to nonsmokers and only a 2 times greater risk for heart disease, the effect on the burden of illness of the population is greater for heart disease, because heart disease is much more common than lung cancer. Absolute risk reduction (ARR): The difference between two absolute risks. If the incidence of stroke in an untreated (or placebo-treated) high-risk population is 5% for a given time period and the incidence for those taking an experimental drug is 1% for the same time period, the AAR is 5% − 1%, or 4%; that is, the drug has lowered the risk of stroke by 4%. The inverse of AAR (1/ ARR) results in the number needed to treat to prevent one adverse event. In this example, clinicians would need to treat 1/0.04, or 25 people, to prevent one stroke. Association: A relationship between an exposure (or a characteristic) and a disease that is statistically dependent; that is, the presence of one alters the probability of observing the presence of the other. Association is a necessary condition of a causal relationship, but many exposure-disease associations are not causal. If there is no association, the variables are said to be independent. Association (test of): A test used to assess the strength of a relationship between or among nominal variables. (As opposed to correlation, a term typically reserved for measures that assess the strength of a relationship between two ordinal or continuous variables.) If two or more variables are associated (or correlated), they tend to occur together. A test of association (such as the chi-square test) indicates whether variables are associated. Yule’s coefficient of association: ad – bc ad + bc Bias: A systematic (as opposed to random) error leading to the deviation of results from the truth in one direction. Bias can result from many sources, including sample selection, measurement methods, and interpretation. Colinearity: In regression analysis, the condition of two or more explanatory variables being correlated or not independent of one another.

Causation (causality):The relating of causes to the effects they produce. Observational epidemiologic evidence by itself is not sufficient to establish causation, but it can provide powerful circumstantial evidence. Confounding: The error, sometimes described as a bias, that can occur when study groups that have been formed based on exposure to a risk factor (eg, periodontal disease) are being compared to determine their association with an outcome (eg, erectile dysfunction) but the groups differ in their exposure to risk or prognostic factors (eg, smoking) other than the factor being investigated. Confounding can cause overestimation or underestimation of the true association between an exposure and an outcome. Confounding variable: A confounding variable is one that (1) has an association with disease; (2) is also associated with the exposure variable that is being investigated and is distributed differently among the different exposure levels; and (3) does not lie on the causal pathway between the exposure and the disease. If a confounding variable is known, appropriate research design can control for it. Correlation: Linear association between two continuous or ordinal variables indicating that a change in one variable is often accompanied by a change in the other. The measure of the correlation is the correlation coefficient, which ranges from 1 (perfect positive association) through 0 (no association) to –1 (perfect negative association). Association, a more general term in common use, is typically used in statistics to describe a relationship between or among categorical variables. Correlation coefficient (r): A value between –1 and 1 that indicates the degree to which two variables are related. A value of 0 indicates no relationship, ie, the variables are independent, a value of 1 a perfect positive relationship, and -1 a perfect negative relationship. The calculation of r depends on the kinds of variables being related. For example, Pearson product-moment is used for interval and ratio scale data whereas Spearman’s rho (which covers a range of 0 to 1) is used for ordinal data. Correlation coefficients differ according type of measurement of the variables being correlated: Kendall’s rank-correlation coefficient measures the linear relationship between two ordinal variables; Pearson’s product-movement correlation coefficient measures the liner relationship between two approximately normally distributed continuous variables; and Spearman’s rank-order correlation coefficient measures the linear relationship between tow variables, one or both of which are markedly non-normally distributed continuous variables.

389

Brunette-App.indd 389

11/6/19 12:11 PM

APPENDICES

Appendix 8 (continued) Correlational studies: Studies concerned with investigating the associations between variables. Correlation matrix: A tabular presentation of correlation coefficients between pairs of variables. Each variable is both a row and a column heading, the value at the table cell where a column and row intersect is the correlation between the variables. The cell entries for rows and columns for the same variable form the diagonal of the table and are always 1.00 (because each variable correlates perfectly with itself). Because the correlation between variable 1 and variable 2 is the same as the correlation between variable 2 and variable 1, the correlation matrix is symmetrical above and below the diagonal, and the table is often simplified by omitting the bottom triangle. Critical region: In hypothesis testing, the critical region is the area in a sampling distribution where the values of the test statistic lead to the rejection of the null hypothesis. Critical value of a statistic: The value of the statistic (obtained from appropriate tables) that the calculated value for a given result must exceed, in order to attain statistical significance. Cox regression (proportional hazards model): A regression method for modeling survival times in which the independent variables are described as predictor variables or prognostic factors. Cox regression makes no assumption about the distribution of survival times. Dose-response relationship (DRR): Change in the response (such as disease) as exposure to the factor of interest increases. The stronger the DRR, the more likely it is that the factor causes the disease. Degrees of freedom (df or dof): is defined as n (sample number) – number of restrictions where a restriction occurs for each parameter that has to be estimated from the observed data. For example, to calculate the variance one has to square the difference between each observed value and the mean. But the mean has to be estimated from the observed values and is thus a restriction. Therefore, if there are 10 data points the df of the variance is (10 – 1) = 9. The shape of probability distributions for test statistics (such as c2) depend on sample size. Error term: A term in a regression equation that captures the degree to which the line is in error (for example, the residual) in describing each point.

Extraneous (confounding, intervening) variable: A study characteristic (variable) that confounds the relationship between the independent and dependent variables (CER); phenomena that have an effect on the study variables but are not necessarily the object of the study. Error or bias in interpreting the relationship between the explanatory and response variables that is created by a (confounding) variable that may cause, prevent, or otherwise affect the outcome of interest and that is also associated with the interventions or characteristics under study. Confounding variables, if known, can be controlled for though good research designs or statistical analysis. Fishing and the error rate problem: A problem that occurs as a result of conducting multiple analyses and treating each one as independent. General linear model (GLM): A mathematical model that underlies many statistical analyses. It is the foundation for the t test, analysis of variance (ANOVA), analysis of covariance (ANCOVA), regression analysis, and many other multivariable methods, including factor analysis, cluster analysis, multidi. Hazard ratio (HR): Commonly used in reporting data on survival curves or other instances when time-toevent is the outcome variable, HR is the ratio of the risk of an event in one group to the risk in another group. If HR = 1, the groups are at equal risk. Unlike relative risk ratios, which are cumulative over an entire study using a defined endpoint, hazard ratios represent instantaneous risks. Historical controls: A cohort of subjects usually identified from an earlier study and used as a control group form comparison with an experimental group. In contrast to concurrent (parallel) controls, for whom data are collected at the same time as the treatment group. Incidence: Incidence is defined as: Number of new cases in a fixed time period Number of people at risk

The rate with which new events or cases appear during a certain period of time. Contrasts with prevalence, which is the rate at which existing events or cases are present at a given point or period in time. Incidence expressed as a proportion is cumulative incidence. Incidence expressed as a rate is called incidence density.

390

Brunette-App.indd 390

11/6/19 12:11 PM

Appendix 8

Independence: In probability theory, the description of a state in which the occurrence of one event does not change the probability of another event; for example, in drawing a playing card from a full deck, the value of the card (ace, jack, etc) is independent of its suit (hearts, spades, etc). Typically demonstrated in statistics by showing that measures of association between variables are small. Most common statistical tests assume that the samples are independent and difficulties arise when they are not. In particular statistical tests using non-independent observations are prone to type I error, claiming a difference when none exists. The question of the independence of sampling units can give rise to debates such as occurred in periodontics on the question of whether sites or mouths should be the sampling unit. Randomization is the commonly used approach for achieving independence of observations. Independent samples: Samples whose values are not affected by other samples. In contrast to paired or matched samples, in which the second value depends to some extent on the value of the first (such as testing the same subjects before and after an intervention), as well as on any experimental intervention. Interaction; interactive effect: A process in which the combined effects of two or more variables are greater than the sum of their individual effects. For example, two explanatory variables interact when the effect of one variable on the response variable depends on the value of the other variable. This contrasts with the main effect, which is the influence of a single explanatory variable on the response variable. Latency (latent period): Period between the onset of exposure and the appearance of clinically detectable disease. For chronic diseases, the latent period can range up to decades. Least squares: The criterion for fitting a regression line, which is a statistically calculated line drawn through a group of points to minimize the distances (actually the sum of the squares of the distances) between each point and the line itself. The regression line used in linear regression is usually a least-squares line. Logistic regression: A method to model the dependence of binary response variable (which takes a value of 1 or 0, such as presence [1] or absence of disease [0]) on one or more exploratory variables, which can be either continuous or categorical. It employs the logit function: logit (p) = ln (p/1 – p), where p is the probability (having a value between 0 and 1) and ln is the natural logarithm to the base e. Logistic regression is typically used to calculate the odds ratio adjusted for some explanatory or predictor variables.

Multiple regression: A technique for fitting a regression model for a random variable which is dependent on several other variables. Such a model for a random variable Y which is dependent on several variables x1, x2, … is V = α β1 x1 β2 x2 + … + ε where β1, β2 … are constants and ε (error) represents the residual variation. The values of β1, β2, … are estimated on the basis of sample data, usually by the method of least squares. Multivariate analysis of variance (MANOVA): MANOVA techniques are used in complex situations where there are two or more independent variables and one or more dependent variables. The goal of multivariate analysis is to examine the relationship between two variables while allowing for the influence of other variables. MANOVA techniques are not for the faint of heart and are often performed by an expert statistician. Observational studies: Besides simple observation and description, the term observational studies is generally used to include those studies in which no treatment is applied to groups but through the circumstances of life, individuals may be exposed to different risks that may be used to group them. Representative approaches include case-control, cohort, cross-sectional, ecologic, and method comparison studies, and a wide range of qualitative research approaches. The term is also used more broadly to refer to all non-experimental studies. Odds ratio (OR): Measure of association obtained from a case-control study (see Table 2-1). OR > 1, increased risk; OR = 1, no difference in risk; OR < 1, decreased risk (protective effect). The OR is skewed; protective effects are seen between 0 and 1, while there is increased risk from 1 to infinity. Overfitting: A term used to describe a statistical model with too many explanatory variables in relationship to the amount of data collected. Such models are said to “overfit” the data. A rule of thumb is that 10 instances of the event need to be recorded for every variable included in the model. Odds: The ratio of the probability that an event will occur divided by the probability that it will not happen: Odds = p / (1 – p), where p is the probability of the event. P value: The probability that an outcome (such as a difference between the means of two groups) would occur by chance. P values range from 1 (absolutely certain) to 0 (absolutely impossible). P < .05 has been chosen as the value that establishes statistical significance. Pearson product-moment correlation: A particular type of correlation used when both variables can be assumed to be measured at an interval level of measurement. Statistic of association using interval

391

Brunette-App.indd 391

11/6/19 12:11 PM

APPENDICES

Appendix 8 (continued) or ratio level data and yielding a score between – 1 and +1. This is a measure of linear association and so is unsuitable if the relationship is nonlinear. Both variables must be random. Prevalence: Frequency of occurrence of a factor or condition in the population. Random variable: A random variable is a variable whose possible values are outcomes of a random phenomenon such as the flip of a coin or a measurement that contains random error. A random variable has a probability distribution, which specifies the probability that its value falls in any given interval. Regression: A statistical method for investigating relationships between variables by using “best fit” procedures, such as least squares, to calculate a functional relationship. Typically, a linear relationship is assumed and fitted by least squares to give, in the simplest case, an equation y = mx + b, where y is the dependent or predicted value in each pair of observations and x is the independent or predictor value in each pair, b is the intercept, and m is the slope of the regression line. Relative risk (RR): Used in the reporting of cohort studies and randomized controlled trials, the RR is the ratio of the incidence of disease or death among the exposed to the incidence of the unexposed = P1 / P2, where P1 is the probability of an event in group 1 (exposed) for a given period of time and P2 is the probability in group 2 (unexposed). Although it uses a different calculation method from the odds ratio, RR approaches the same value when the probabilities are small. For example, among nonsmokers, people who drink more than 1.6 oz of alcohol per day have 2.33 times the risk of oral cancer compared with persons who do not drink alcohol. To put RR values in perspective, RR should be complemented by absolute risk and population impact information. Relative risk reduction (RRR): The expression of relative risk in percentage (relative risk difference; attributable risk). The reduction in risk to the treatment or unexposed group expressed as a percentage of the risk in the exposed group; the absolute risk difference divided by the risk in the unexposed group. For example, if the incidence of nausea in men with esophageal reflux disease who take omeprazole is 1.2% and the incidence in men given another drug is 2.2%, then the RRR for men taking omeprazole is 45% [(2.2% – 1.2%) / 2.2% = 45%].

Reliability (consistency): The degree to which a measurement is consistent, that is repeated measures of the same property on the same experimental unit give the same value. Various means are used to estimate reliability depending on the type of measurement, including Crombach’s alpha or the correlation coefficient. Repeated measures: The situation where a group of cases are measured on more than one occasion over time, eg, prior to and following an intervention, or on the same observational unit, such as a family or medical practice. Representative sample: A sample that accurately reflects the characteristics of the population from which it is drawn; sometimes termed an “unbiased” sample.Residual: In regression analysis, the difference between an observed value and a predicted value. Residual confounding: In a regression model, the residual unexplained variance that remains after adjustment for confounders; that is, it represents inadequate control of confounding. It can occur because of failure to consider some confounders, inaccurate measurement of exposure (such as self-reports of tobacco use), insufficient data in some of the strata of confounders being adjusted for, and other violations of the model’s assumptions. Residual standard deviation: In regression analysis, the square root of the error mean square. A measure of variability of the data. Also called the root mean square. Retrospective: A study conducted after data have already been collected, often for another purpose. Specific types include case-control or chart studies and retrospective cohort studies. Risk: The probability that an (often unfavorable) event, such as death or illness, will occur within a stated period of time. Risk factor: A personal characteristic or exposure that is associated with the occurrence of disease. Sample: A sample is a subset of observations from a parent population. The usual goal is to select a sample that is representative of the parent population and is used to provide information about its distribution. Measurements are made on each unit in the sample and sample statistics are calculated that are used to infer the value of the parent population parameters. The unit in the sample can vary, for example, from measurements of properties of an individual to

392

Brunette-App.indd 392

11/6/19 12:11 PM

Appendix 8

a sample of the effects of a treatment. For example, the difference in attachment level of ten patients in a university clinic before and after surgery by a postgraduate student is a sample of the effects of all surgeries in that clinic, but the question arises of whether it would be representative of surgeries performed in specialist offices. Thus, readers of the literature must be careful in assessing how generally the findings of a study based on such samples can be extrapolated. Sampling distribution: A statistic calculated from a sample from a population, such as a mean, has a theoretical sampling distribution, which is defined as the distribution of that statistic that would be produced by repeated random sampling from the same population. In constructing a sampling distribution, one assumes that an infinite number of samples of a given size have been drawn from the population, and the statistic of interest, for example the mean, is calculated from these hypothetical samples. The distribution of the sample statistic, eg, the distribution of sample means, ie, its sampling distribution is then calculated. The procedures of inferential statistics devolve on comparing actual sample statistics with the theoretical sampling distribution of a population to determine the probability that the sample statistic could have occurred by chance. Sampling method: The method by which a sample is drawn from a population. Broadly, there are two approaches: (1) random, in which every case in the population has an equal chance of selection, and (2) non-random, in which cases have different chances of selection. Scattergram: A graphic plotting of a sample on an xy-plane from a bivariate distribution (ie, one in which each individual or unit is associated with two values an x value and a y value). Each pair of the observations is platted as a point. The points are plotted but not joined. Commonly when a pattern is detected in the data, one goes on to calculate either a correlation or perform a regression. For a correlation, it does not matter which variable is labeled x and which is labeled y, but for regression it is important that the dependent or predicted value variable is labeled y, while the independent or predictor variable is always labeled x.

Selection bias: Error due to systematic differences in characteristics between subjects (or papers) selected for study and those who are not selected. Selection of subjects: The selection of subjects is a multi-level process involving the determination of the (1) target population, ie, the group to which one hopes to refer one’s results, (2) the source population which is accessible to (3) eligible subjects who meet the criteria for entry into the study, and (4) study participants. Bias can operate at these different levels so that the population actually studied is nor representative of the target population. For example, if a significant proportion of eligible subjects elect not to participate, the participants may include a higher proportion of compliant volunteers than is found in the target population. Sensitivity analysis: A study of how the final outcome of an analysis changes as a function of varying one or more of the input parameters in a prescribed manner. A method often used in meta-analysis, economic evaluations, and decision analyses for assessing the effects of key assumptions or values on the final result. The assumptions are varied over their range of values to determine their effect on the result. Large differences in effects indicate that the analysis is “sensitive” to the assumption. One-way sensitivity analysis varies one variable at a time. Two-way sensitivity analysis varies two at a time, and so forth. In deterministic sensitivity analysis, variables are tested with point estimates; in probabilistic sensitivity analysis, variables are tested with ranges of estimates. Spurious association: A statistical relationship in which the variables have no causal connection, for example, obesity and use of dental floss. Standardization of data: Transformation of the variable in distribution so that it has mean 0 and standard deviation 1. The transformation is z = (x – µ)/σ on all scores in a set where x is the particular score, µ is the population mean and σ the standard deviation. The transformed z score is a measure of the relative location of a score in its distribution. This property can be useful if one is comparing sets of scores from sources that have different averages or variance, for example different examiners or different tests.

393

Brunette-App.indd 393

11/6/19 12:11 PM

APPENDICES

Appendix 9

Guidance in Clinical Decision Making: Third Molars Clinical Scenario Management of impacted third molars

Clinical Guidance We do not generally recommend the prophylactic extractions of asymptomatic third molars. We generally recommend the extraction of third molars for the following reasons: 1. 2. 3. 4. 5.

History of oral/facial pain and/or swelling associated with third molars (ie, pericoronitis, cellulitis) Unrestorable caries and/or endodontically involved teeth Damage to adjacent teeth and structures Evidence of cyst and tumors Medical reasons arising from planned treatments such as orthodontics or a concern expressed by a healthcare professional

In general, the office recommendations are based on current balance of evidence for and against third molar removal and employ the precautionary principle; that is, when there is a possibility for harm associated with a given treatment, it should not be given unless there is adequate scientific evidence justifying its use. Background

Impacted third molars (ie, wisdom teeth) are not expected to erupt into occlusal function. They can be completely impacted with only radiograph evidence of their presence or partially erupted with visual evidence of soft tissue coverage.1 Symptomatic impacted third molars typically present with the patient complaining of pressure, pain, or swelling. Asymptomatic impacted third molars are not erupted or partially erupted with no history of patient complaints. However, asymptomatic impacted third molars do not imply the absence of pathology as some third molars may have associated clinical signs of caries, periodontal disease, chronic abscess, or cyst or tumor formation.2 Etiology

The etiology of impacted third molars are usually due to a lack of space or obstruction from adjacent teeth., which prevents the tooth from erupting into functional occlusion. Prevalence

The prevalence of impacted third molars is reported to range from 18% to 68%.3 In adults between the ages of 20 and 30, the prevalence is likely around 50%.4 It is theorized that modern diet is partially responsible for the high occurrence of impacted third molars.4

394

Brunette-App.indd 394

11/6/19 12:11 PM

Appendix 9

Treatment Options Therapy

Surgical removal

Complications

Reported risk (%)

Reference

Overall

4.6–31

5

Dry socket

0.4–26

5

Infection

0.8–4.2

5

Bleeding

0.2–5.8 0.5–23 0.4–6.4

Damage to the adjacent teeth

0.3–0.4

5