Psychological science under scrutiny : recent challenges and proposed solutions 9781118661086, 1118661087, 9781119095910, 139-141-164-1

700 104 4MB

English Pages 371 [400] Year 2017

Polecaj historie

Religious Freedom Under Scrutiny 9780812296686

Religious Freedom Under Scrutiny argues that without freedom of religion or belief, human rights cannot fully address th

204 132 2MB Read more

Taking Pluralism Seriously: Complex Societies under Scrutiny 9780228013136

Creating pathways toward peaceful cohabitation as liberal democracies diversify and minorities demand recognition and ac

179 55 1MB Read more

Malaysia : recent trends and challenges 9789812303400, 9812303405

466 71 2MB Read more

Blockchain: Applications, Challenges, and Solutions 303656473X, 9783036564739

Blockchain applications in different industries are booming, yet there are many unresolved challenges with the mass adop

160 63 13MB Read more

Blockchain: Applications, Challenges, and Solutions 303656473X, 9783036564739

Blockchain applications in different industries are booming, yet there are many unresolved challenges with the mass adop

253 34 13MB Read more

From Marx to Warner : Class and Stratification under Scrutiny [1 ed.] 9781527509412, 9781443898805

The book offers an in-depth analysis of several important theories of social class and stratification, both past and pre

221 45 2MB Read more

A Psychological Portrait of Donald Trump: Under a Physiognomic Science Lens

Are you currently interested in politics, military tactics, criminal justice, poular culture, psychology, and other rela

157 5 5MB Read more

Data Sharing: Recent Progress and Remaining Challenges 1536146773, 9781536146776

Coherence of the information world is needed to approve technology concerning data sharing. This book gives to a wide ra

487 91 5MB Read more

Big Data Optimization: Recent Developments and Challenges 9783319302638, 9783319302652, 3319302639

1,551 145 14MB Read more

Sustainability in Fashion and Apparels: Challenges and Solutions 9385059297, 9789385059292

This book addresses the pathway to reach sustainability in fashion business and apparel sectors. It contains various res

673 96 8MB Read more

Psychological science under scrutiny : recent challenges and proposed solutions
9781118661086, 1118661087, 9781119095910, 139-141-164-1

Author / Uploaded
Lilienfeld
Scott O.; Waldman
Irwin D

Table of contents :
Content: Introduction: psychological science under scrutiny : recent challenges and proposed remedies / Scott O. Lilienfeld and Irwin D. Waldman --
Cross-cutting challenges to psychological science --
Maximizing the reproducibility of your research / Open Science Collaboration --
Powering reproducible research / atherine S. Button and Marcus R. Munaf --
Psychological science's aversion to the null and why many of the things you think are true, aren't / Christopher J. Ferguson and Moritz Heene --
False negatives / Klaus Fiedler and Malte Schott --
Toward transparent reporting of psychological science / Etienne P. LeBel and Leslie K. John --
Decline effects : types, mechanisms, and personal reflections / John Protzko and Jonathan W. Schooler --
Reverse inference / Joachim I. Krueger --
The need for bayesian hypothesis testing in psychological science / Eric-Jan Wagenmakers, Josine Verhagen, Alexander Ly, Dora Matzke, Helen Steingroever, Jeff N. Rouder, and Richard Morey --
Domain-specific challenges to psychological science --
The (partial but) real crisis in social psychology : a social influence analysis of the causes and solutions / Anthony R. Pratkanis --
Popularity as a poor proxy for utility : the case of implicit prejudice / Gregory Mitchell and Philip E. Tetlock --
Suspiciously high correlations in brain imaging research / Edward Vul and Harold Pashler --
Critical issues in genetic association studies / Elizabeth Prom-Wormley, Amy Adkins, Irwin D. Waldman, Danielle Dick --
Is the efficacy of ?antidepressant? medications overrated? / Brett J. Deacon and Glen I. Spielmans --
Pitfalls in parapsychological research / Ray Hyman --
Psychological and institutional obstacles to high quality psychological science --
Blind analysis as a correction for confirmatory bias in physics and psychology / Robert J. MacCoun and Saul Perlmutter --
Allegiance effects in clinical psychology research and practice / Marcus T. Boccaccini. David Marcus, and Daniel C. Murrie --
We can do better than fads / Robert J. Sternberg --
Afterword: paul bloom --
Index.

Citation preview

Psychological Science Under Scrutiny

Psychological Science Under Scrutiny Recent Challenges and Proposed Solutions Edited by

Scott O. Lilienfeld Irwin D. Waldman

This edition first published 2017 © 2017 John Wiley & Sons, Inc Registered Office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK Editorial Offices 350 Main Street, Malden, MA 02148‐5020, USA 9600 Garsington Road, Oxford, OX4 2DQ, UK The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK For details of our global editorial offices, for customer services, and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley‐blackwell. The right of Scott O. Lilienfeld and Irwin D. Waldman to be identified as the authors of the editorial material in this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging‐in‐Publication Data Names: Lilienfeld, Scott O., 1960– editor. | Waldman, Irwin D., editor. Title: Psychological science under scrutiny / edited by Scott O. Lilienfeld, Irwin D. Waldman. Description: Hoboken : Wiley, [2017] | Includes bibliographical references and index. Identifiers: LCCN 2016031603 (print) | LCCN 2016036478 (ebook) | ISBN 9781118661079 (pbk.) | ISBN 9781118661086 (pdf) | ISBN 9781118661048 (epub) Subjects: LCSH: Psychology. Classification: LCC BF121 .P76217 2017 (print) | LCC BF121 (ebook) | DDC 150.72–dc23 LC record available at https://lccn.loc.gov/2016031603 A catalogue record for this book is available from the British Library. Cover image: © Takashi Kitajima/Getty Images, Inc. Set in 10.5/13pt Minion by SPi Global, Pondicherry, India

10 9 8 7 6 5 4 3 2 1

Contents

List of Contributors Introduction: Psychological Science in Perspective

vii x

Part I Cross‐Cutting Challenges to Psychological Science

1

1 Maximizing the Reproducibility of Your Research Open Science Collaboration

3

2 Powering Reproducible Research Katherine S. Button and Marcus R. Munafò 3 Psychological Science’s Aversion to the Null, and Why Many of the Things You Think Are True, Aren’t Moritz Heene and Christopher J. Ferguson

22

34

4 False Negatives Klaus Fiedler and Malte Schott

53

5 Toward Transparent Reporting of Psychological Science Etienne P. LeBel and Leslie K. John

73

6 Decline Effects: Types, Mechanisms, and Personal Reflections John Protzko and Jonathan W. Schooler

85

7 Reverse Inference Joachim I. Krueger

108

8 The Need for Bayesian Hypothesis Testing in Psychological Science Eric‐Jan Wagenmakers, Josine Verhagen, Alexander Ly, Dora Matzke, Helen Steingroever, Jeffrey N. Rouder, and Richard D. Morey

123

vi Contents

Part II Domain‐Specific Challenges to Psychological Science 9 The (Partial but) Real Crisis in Social Psychology: A Social Influence Analysis of the Causes and Solutions Anthony R. Pratkanis

139 141

10 Popularity as a Poor Proxy for Utility: The Case of Implicit Prejudice Gregory Mitchell and Philip E. Tetlock

164

11 Suspiciously High Correlations in Brain Imaging Research Edward Vul and Harold Pashler

196

12 Critical Issues in Genetic Association Studies Elizabeth Prom‐Wormley, Amy Adkins, Irwin D. Waldman, and Danielle Dick

221

13 Is the Efficacy of “Antidepressant” Medications Overrated? Brett J. Deacon and Glen I. Spielmans

250

14 Pitfalls in Parapsychological Research Ray Hyman

271

Part III Psychological and Institutional Obstacles to High‐Quality Psychological Science

295

15 Blind Analysis as a Correction for Confirmatory Bias in Physics and in Psychology 297 Robert J. MacCoun and Saul Perlmutter 16 Allegiance Effects in Clinical Psychology Research and Practice Marcus T. Boccaccini, David Marcus, and Daniel C. Murrie

323

17 We Can Do Better than Fads Robert J. Sternberg

340

Afterword: Crisis? What Crisis? Paul Bloom

349

Index 356

List of Contributors

Amy Adkins Virginia Commonwealth University Paul Bloom Yale University Marcus T. Boccaccini Sam Houston State University Katherine S. Button University of Bath Brett J. Deacon University of Wollongong, School of Psychology, New South Wales, Australia Danielle Dick Virginia Commonwealth Univesity Christopher J. Ferguson Stetson University Klaus Fiedler University of Heidelberg Moritz Heene Ludwig Maximillian University Munich

viii

List of Contributors

Ray Hyman University of Oregon Leslie K. John Harvard University Joachim I. Krueger Brown University Etienne P. LeBel Montclair State University Scott O. Lilienfeld Emory University Alexander Ly University of Amsterdam Robert J. MacCoun Stanford University David Marcus Washington State University Dora Matzke University of Amsterdam Gregory Mitchell University of Virginia Richard D. Morey University of Groningen Marcus R. Munafò MRC Integrative Epidemiology Unit, UK Centre for Tobacco and Alcohol Studies, and School of Experimental Psychology, University of Bristol, United Kingdom Daniel C. Murrie University of Virginia Harold Pashler UCSD Psychology Saul Perlmutter University of California at Berkeley

List of Contributors

Anthony R. Pratkanis University of California, Santa Cruz Elizabeth Prom‐Wormley Virginia Commonwealth University John Protzko University of California, Santa Barbara Jeffrey N. Rouder University of Missouri Jonathan W. Schooler University of California, Santa Barbara Malte Schott University of Heidelberg Glen I. Spielmans Metropolitan State University, Department of Psychology, Minnesota University of Wisconsin, Department of Counseling Psychology, Wisconsin Helen Steingroever University of Amsterdam Robert J. Sternberg Cornell University, Ithaca, New York Philip E. Tetlock University of Pennsylvania Josine Verhagen University of Amsterdam Edward Vul UCSD Psychology Eric‐Jan Wagenmakers University of Amsterdam Irwin D. Waldman Emory University

ix

Introduction

Psychological Science in Perspective Scott O. Lilienfeld and Irwin D. Waldman

The essence of science, including psychological science, is ruthless and relentless self‐criticism. At its best, psychological science subjects cherished claims to searching scrutiny. Conclusions that survive close examination are provisionally retained; those that do not are modified or jettisoned. In this way, psychological science, like other sciences, is ultimately self‐correcting and progressive. Some authors (e.g., Berezow, 2015; Hartsfield, 2015) have questioned this upbeat appraisal, and have argued without qualification that psychology is not a science. This pronouncement neglects the crucial point that science is not a body of knowledge; it is an approach to acquiring and evaluating knowledge. Specifically, it is an approach that strives to reduce error by implementing methodological safeguards, such as randomization to experimental conditions, the use of blinded observations, and sophisticated statistical analyses, thereby yielding a closer approximation to reality (Lilienfeld, 2010). By these standards, much of contemporary psychology is every bit as scientific as traditional “hard” sciences, such as chemistry and physics. By availing itself of these bulwarks against error, psychological science has been quite successful across myriad domains. Moreover, it has spawned numerous discoveries of both theoretical and practical importance. To take merely a handful of salient examples, psychological science has helped us to better understand the basic mechanisms of learning, the nature of memory, the structure of emotion, the nature of individual differences in cognitive abilities, and the correlates and causes of many mental disorders (Hunt, 2009). Moreover, some psychological findings, such as classical conditioning, visual afterimages, the serial position effect in memory, the impact of peers on conformity, and the effects of prolonged exposure on pathological anxiety, are just as replicable as those in the hard sciences (Meehl, 1986). Psychological science has also borne fruit in such real‐world applications as aptitude testing, political polling, behavioral medicine, advertising, eyewitness testimony, the design of airplane cockpits, automobile safety, techniques for teaching language to children with intellectual disability, the reduction of prejudice in classrooms, and

Introduction: Psychological Science in Perspective

xi

evidence‐based psychotherapies that have alleviated the suffering of tens of thousands of individuals with mood, anxiety, eating, sleep, and substance disorders (Zimbardo, 2004). There is ample reason to be proud of psychological science and its accomplishments.

Psychological Science Under Scrutiny Nonetheless, over the past decade, and the past several years in particular, the prevailing narrative of psychological science as a progressive discipline that is characterized by replicable findings has been cast into serious doubt (Yong, 2012). More broadly, the commonly accepted methodologies of psychological science have come under withering attack from both within and outside the profession. Many of the pointed challenges posed by critics have given pause to psychological researchers across a broad spectrum of subfields, including experimental social psychology, cognitive psychology, functional brain imaging, molecular behavioral and psychiatric genetics, the validity of projective techniques (such as the Rorschach inkblot test), psychotherapy outcome research, and eyewitness memory. These scholars have argued that psychological findings are considerably less trustworthy than many of us, the two editors of this book included, have long presumed. This edited book imparts the story of these recent critical appraisals of “business as usual” across a broad spectrum of domains of psychological science. It also describes what our field has learned from these critiques, and how psychological science can improve in response to them. Much of the impetus behind these recent criticisms stems from the influential work of Stanford University epidemiologist John Ioannidis, whose 2005 article, “Why most published research is false,” has engendered considerable self‐reflection in medicine and related fields (as of this writing, this article has been cited over 4,000 times by other scholars). According to Ioannidis’s (2005) eye‐opening analysis, approximately 40% of published findings in medicine are incorrect or substantially overestimated in magnitude. Whether this percentage is itself overestimated remains a lively topic of debate (e.g., Goodman & Greenland, 2007), but there can be little doubt that many widely ballyhooed medical findings may be less robust than commonly assumed. As one striking example, when the biotechnology firm Amgen recently attempted to replicate 53 “landmark” published findings on cancer treatment, they failed in 47 cases (Begley & Ellis, 2012). Although Ioannidis and several other authors have directed their broadsides toward medicine and science more generally, most critics have pointed the quills of their arrows increasingly at psychological science. Indeed, in the pages of our field’s most prestigious journals, including Psychological Science, Perspectives on Psychological Science, Psychological Methods, American Psychologist, and the Journal of Personality and Social Psychology, scholars across diverse subdisciplines have maintained that the standard approaches adopted in published psychological investigations tend to yield a disconcertingly large number

xii

Introduction: Psychological Science in Perspective

of false positive findings (e.g., Pashler & Wagenmakers, 2012). Among other things, these researchers have observed that psychological investigators sometimes confuse exploratory (hypothesis generation) with confirmatory (hypothesis testing) modes of data analysis, thereby inflating the risk of erroneous conclusions. Exploratory data analysis, although enormously useful for certain purposes (Tukey, 1977), can lend itself to a host of abuses. In particular, critics have raised legitimate concerns regarding HARKing (hypothesizing after results are known), which refers to the tendency to portray post‐hoc conclusions as a priori hypotheses (Kerr, 1998), and p‐hacking, which refers to a family of practices that can cause findings that were initially statistically nonsignificant to dip below the threshold of statistical significance (typically below the standard p = 0.05 threshold; Lindsay, 2015; Simonsohn, Nelson, & Simmons, 2014). These worrisome but often largely overlooked practices are both prevalent and detrimental to the progress of psychological science. p‐hacking practices include exclusion of outliers, transformation of distributions, combining one or more subgroups, “cherry‐picking” of positive findings within studies (more technically termed outcome reporting bias; Chan, Krleža‐Jerić, Schmid, & Altman, 2004), splitting analyses by demographic groups (e.g., males versus females), and repeatedly commencing and halting data collection until significance level drops below the p = 0.05 level (optional starting and stopping points; Gilovich, 1991). Some of these practices, such as excluding outliers or transforming distributions, are often entirely appropriate in exploratory research, as they can point investigators toward fruitful questions to be pursued in future research. Nevertheless, these practices can become exceedingly problematic when they are conducted on a post‐hoc basis but are reported in published articles as though they were planned. Needless to say, p‐hacking can result in pronounced overestimates of the prevalence of statistically significant effects in given fields, as well as substantially inflated estimates of the average effect size in these fields. p‐hacking practices within psychology and allied disciplines may also help to account for the curious finding that the proportion of positive findings in psychology and psychiatry – approximately 90% – apparently exceeds that in all other domains of science (Fanelli, 2010). Indeed, given that the average statistical power of studies in psychology is low – just over 40% by some estimates (Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989) – this remarkably high percentage is almost surely “too good to be true.” That is, the proportion of statistically significant findings in psychology appears to be considerably larger than would be expected given the modal statistical power of investigations, again raising the specter of false positive findings. Over the past several years, considerable attention has also been accorded to the decline effect (the “law of initial results”), the apparent tendency of effects reported by investigators in initial studies of a given phenomenon to diminish or even disappear over time (Schooler, 2011). For example, some evidence suggests that the well‐known “bystander nonintervention” effect, whereby people are less likely to intervene in emergencies when others are present, has mysteriously shrunk in magnitude over the past few decades (Fischer et al., 2011). Recent findings similarly

Introduction: Psychological Science in Perspective

xiii

raise the possibility that the efficacy of cognitive behavioral therapy for major depressive disorder has been decreasing over time (Johnsen & Friborg, 2015). If the decline effect is pervasive in psychology, it would imply that many well‐accepted psychological findings and conclusions are likely to decay.

The Replication Crisis It can be devilishly difficult to ascertain whether a statistically significant result is genuine. As a consequence, readers of the psychological literature are often left to wonder which results to trust and which to ignore. Perhaps the most crucial criterion for evaluating the robustness of psychological findings is replication, especially when conducted by independent investigative teams (Lindsay, 2015; Lykken, 1968). As the philosopher of science Sir Karl Popper (1959) observed, “non‐reproducible single occurrences are of no significance to science” (p. 66). Over the past decade, numerous scholars have raised concerns regarding the replicability of psychological findings, with many referring to the present state of tumult as the “replication crisis” (Bartlett, 2014). Admittedly, the term “crisis” may be an overstatement, as the magnitude of the replicability problem across domains of psychology is largely unknown. At the same time, it has become increasingly evident that psychologists can no longer take the replicability of their findings for granted (Asendorpf et al., 2013). Because they are rarely perceived as “exciting” or “sexy,” replication efforts have been greatly undervalued in most domains of psychology. Furthermore, many premier psychology journals have been loath to publish either successful or unsuccessful replications even though these studies should typically be accorded at least equal weight as the original investigations, and perhaps even more. Indeed, until relatively recently, some scholars doubted or even dismissed the importance of replication. For example, in the immediate wake of a controversial article by Bem (2011) purporting to uncover evidence for precognition, a form of extrasensory perception, the prominent journal that initially published the findings (the Journal of Personality and Social Psychology) initially refused to consider failed replications for potential publication (e.g., Ritchie, Wiseman, & French, 2012). Fortunately, this attitude has begun to recede, and the Journal of Personality and Social Psychology has since abandoned its “no replication” policy in response to a flurry of criticism. Still, as psychologists have belatedly begun to take stock of the replicability of their findings, their efforts have met with an unwelcome surprise: Many of their studies are considerably less replicable than was initially assumed. For example, a widely publicized collaborative effort by the Open Science Collaboration (2015) to directly replicate 100 published findings in social and cognitive psychology revealed that only 39% of the studies were subjectively rated as having replicated the original results (but see Gilbert, King, Pettigrew, & Wilson, 2016, for an alternative view). As the authors themselves wisely noted, these replication failures do not necessarily imply that the original findings were false; moreover, as they observed, there

xiv

Introduction: Psychological Science in Perspective

is no single metric for gauging replication success (although in their analyses, replicability was relatively low, regardless of which metric they used). Nevertheless, these sobering findings suggest that psychologists and consumers of psychological research, including the news media, should place considerably less weight than they currently do on unreplicated findings (Waldman & Lilienfeld, 2015). Controversies surrounding the insufficient attention accorded to replication have recently spilled over into lively and at times contentious blog discussions in social psychology, where widely cited research on the effects of priming on nonverbal behavior, such as that conducted by Yale psychologist John Bargh and others (see Bargh, Chen, & Burrows, 1996), has not withstood scrutiny by independent investigators (Doyen, Klein, Pichon, & Cleeremans, 2012). Although some psychological researchers have dismissed large‐scale replication efforts as possessing little or no scientific value (e.g., Mitchell, 2014), this defensive reaction is unwarranted. Only by ascertaining whether their findings survive multiple direct replication efforts can psychologists hope to ascertain their veracity (Simons, 2014). The principal takeaway lesson from the recent debates is not that most psychological findings are unreplicable. Rather, it is that we need to fundamentally change the way we think about psychological data and results. Rather than conceptualizing each new study as a source of settled conclusions, we need to conceptualize its findings as merely one data point in a large population of potential studies, many or most of which have yet to be conducted (Waldman & Lilienfeld, 2015). We need to think meta‐analytically, even when we are not conducting formal meta‐analyses. Another crucial bottom‐line lesson is that higher statistical power is necessary to boost the replicability of psychological science (Asendorpf et al., 2013; Tajika, Ogawa, Takeshima, Hayasaka, & Furukawa, 2015). Prominent statistically oriented psychologists have long lamented the low statistical power of most studies in their discipline (Cohen, 1962), but to little avail (Sedlmeier & Gigerenzer, 1989). Virtually all psychological researchers recognize that low statistical power is tied to a higher likelihood of false negative results. Unfortunately, many of these same researchers also erroneously assume that if a finding is statistically significant even with a small sample size, it is especially likely to be robust and replicable (indeed, we continue to hear this view espoused by a number of our academic colleagues). In fact, the opposite is true (Button et al., 2013; Walum, Waldman, & Young, 2015). Because of a statistical phenomenon known as winner’s curse, results from underpowered studies that manage to attain statistical significance are less likely to be genuine because their effects must be overestimated (i.e., positively biased) in order to achieve statistical significance. Moreover, even when genuine, their effects sizes are likely to be overestimated. The more underpowered the study, the greater the likelihood of such bias. Psychology is hardly alone in its replicability challenges, as the Amgen episode we have already mentioned amply underscores. There is little evidence that replicability is substantially lower in subfields of psychology than it is in other scientific domains, including particle physics (Hedges, 1987; cf. Hartsfield, 2015). This point appears to have been lost on a number of observers. For example, in response to the

Introduction: Psychological Science in Perspective

xv

publication of the recent Open Science Collaboration (2015) replicability findings, a recent president of the American Psychiatric Association and influential author, Jeffrey Lieberman, tweeted that psychology is “in shambles” (see McKay & Lilienfeld, 2015). Ironically, soon after this tweet appeared, an article reporting comparable replicability problems in psychiatry appeared in print (Tajika et al., 2015). In their review, the authors examined 83 widely cited articles in psychiatry journals that had reported results for specific interventions. Of the studies reported therein, 40 had never been subjected to replication attempts, 11 were contradicted by later findings, and 16 reported substantially smaller effect sizes than in the original study; only 16 of the original studies were successfully replicated. Clearly, replicability is a concern for science at large, not merely psychological science.

Other Challenges to Psychological Science The challenges to psychological science do not end there. A growing cadre of scholars has argued that the “file drawer problem,” the tendency of negative studies to remain selectively unpublished (Rosenthal, 1979), poses a serious threat to the integrity of conclusions in psychology and other sciences (Franco, Malhotra, & Simonovits, 2014). Such publication bias may exacerbate the problem of false positives generated by HARKing, p‐hacking, and other problematic research practices. Although a host of helpful statistical approaches, such as funnel plots of effect sizes (Duval & Tweedie, 2000), exist for estimating the impact of publication biases on psychological conclusions, none is free of limitations. To address the file drawer problem and other forms of publication bias (e.g., outcome reporting bias), a number of researchers have proposed that the raw data from published psychological studies be placed in publicly available registries for re‐analyses by independent scholars (e.g., Asendorpf et al., 2013; Ioannidis, Munafo, Fusar‐Poli, Nosek, & David, 2014). Many of these researchers have further suggested that investigators’ hypotheses be pre-registered, thereby minimizing the likelihood of outcome reporting bias. Nevertheless, these proposed remedies have met with vocal opposition in some quarters. In other cases, critics have contended that psychological researchers frequently neglect to account for the a priori plausibility of their theories when appraising their likelihood. According to these critics, investigations in certain domains, such as parapsychology (the study of extrasensory perception and related paranormal phenomena), should be held to much higher evidentiary standards than those in other fields, because the claims advanced by researchers in the former fields run strongly counter to well‐established scientific conclusions. Many of these critics have lobbied for a heightened emphasis on “Bayesian” approaches to data analysis, which consider the initial scientific plausibility of findings when evaluating their probability (Wagenmakers, Borsboom, Wetzel, & van der Maas, 2011). In addition, over the past decade or so, a growing chorus of scholars has insisted that many well‐accepted psychological and psychiatric findings, such as those concerning stereotype threat, implicit prejudice, unconscious priming,

xvi

Introduction: Psychological Science in Perspective

psychopharmacology (e.g., the efficacy of antidepressant medications relative to placebos), and psychotherapy outcome research, have been substantially overhyped. For example, in the domain of psychotherapy research, some meta‐analyses suggest that a substantial proportion of the variability in client outcomes is attributable to allegiance effects – that is, the extent to which investigators conducting the studies are partial to the intervention in question (Luborsky et al., 1999). Finally, over the past few years, several high‐profile examples of definitive or probable data fabrication, falsification, and other questionable research practices (e.g., presenting exploratory analyses as confirmatory, omitting mention of relevant dependent variables that yielded nonsignificant findings) have raised troubling questions concerning psychology’s capacity to police itself (John, Lowenstein, & Prelec, 2012). More recent evidence points to a nontrivial prevalence of statistical reporting errors in major psychological journals (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2015). Perhaps not surprisingly, the distribution of these errors appears to be nonrandom, and conveniently tends to favor the authors’ hypotheses.

Media Attention Many of the recent challenges to psychological science have begun to catch the public eye. In a widely discussed 2011 article in the New York Times (“Fraud Case Seen as Red Flag for Psychology Research”), science journalist Benedict Carey discussed the shocking case of eminent Dutch social psychologist and journal editor Diederick Stapel, much of whose research was discovered to have been blatantly and brazenly fabricated. Carey (2011) wrote that: Experts say the case exposes deep flaws in the way science is done in a field, psychology, that has only recently earned a fragile respectability. … The scandal, involving about a decade of work, is the latest in a string of embarrassments in a field that critics and statisticians say badly needs to overhaul how it treats research results. In recent years, psychologists have reported a raft of findings on race biases, brain imaging and even extrasensory perception that have not stood up to scrutiny.

Only the year before, now‐disgraced journalist Jonah Lehrer penned a provocative piece in The New Yorker magazine on the decline effect entitled “The Truth Wears Off: Is There Something Wrong with the Scientific Method?” In this article, Lehrer (2010) highlighted the baffling tendency of findings in psychology and several other scientific fields to shrink in size over time. According to Lehrer, the decline effect may point to a fundamental flaw in how researchers in psychology and allied disciplines analyze and interpret data. Other articles questioning the standard methods of psychology (in such subdisciplines as neuroimaging and parapsychology) have recently appeared in such high‐ profile outlets as the Atlantic, Newsweek, Scientific American, and Seed magazines. In 2015, the results of the Open Science Collaboration, which as noted earlier

Introduction: Psychological Science in Perspective

xvii

revealed that the lion’s share of 100 published studies on social and cognitive psychology did not survive attempts at direct replication, received prominent coverage in the New York Times, Wall Street Journal, and other major venues. Finally, controversies regarding the potential overhyping of psychiatric medications, stemming in part from the alleged suppression of negative findings by drug companies, have received national coverage on 60 Minutes and other media outlets. Needless to say, this media coverage has not only given psychology and psychiatry something of a “black eye” in the view of much of the general public (Ferguson, 2015), but has led many scholars inside and outside of psychology to ask whether the status quo methodologies of our field are ripe for reexamination.

Scrutiny of Psychological Science as Healthy Self‐Criticism In contrast to many outspoken critics, we regard many or most of the recent questions raised about the modal methodological approaches of psychological science as signs of the health of our discipline. Some readers may find this statement to be surprising. Yet, in many respects, the recent scrutiny accorded to psychological science by psychological scientists themselves exemplifies science working precisely as it should – subjecting claims to intense criticism in a concerted effort to winnow out errors in one’s web of beliefs (O’Donohue, Lilienfeld, & Fowler, 2007). Far from psychology being in shambles, our field is leading the way to improving not only the conduct of psychological science, but of science itself. We suspect that some of the recent criticisms directed at psychological science will turn out to possess substantial merit, whereas others may not. Even so, many of these criticisms have posed important challenges to the status quo in our field, and raised thoughtful questions concerning long‐held assumptions about how best to design studies, perform statistical analyses, and interpret findings (Ferguson, 2015). They have also underscored the field’s insufficient emphasis on systematic safeguards against spurious findings and independent replication, and its frequent overemphasis on results that are surprising, splashy, or “sexy.” Indeed, we are unreserved optimists regarding the future of psychological science. We view these recent criticisms not as threats, but rather as opportunities to identify and minimize heretofore underappreciated sources of error in psychology’s findings and conclusions. Indeed, to the extent that these criticisms can point psychologists, psychiatrists, and others toward novel methodologies to root out and eliminate these sources of error, they may ultimately prove to be psychological science’s best allies.

This Book In the spirit of subjecting claims to rigorous scrutiny, in this edited book we “turn the tables” and place psychological science itself under the microscope. In this book, we explore a variety of recent challenges to the standard methodologies and

xviii

Introduction: Psychological Science in Perspective

assumptions of psychological science. Just as important, we examine the advantages and disadvantages of proposed remedies for these challenges. In this way, we intend to stimulate constructive debate regarding how to enhance the trustworthiness of psychology’s findings and conclusions, and ultimately to make psychology more firmly grounded in solid science. The chapters are authored by internationally recognized experts in their fields, and written with a minimum of specialized terminology. In addition, each chapter lays out not only the pertinent challenges posed to psychological science, but proposed solutions and, when relevant, constructive suggestions for future research. This book should be of considerable interest to researchers, teachers, advanced undergraduates, and graduate students across all domains of psychology, as well as to those in allied fields, such as psychiatry, psychiatric nursing, counseling, and social work. In addition, this book should be relevant to investigators in sociology, anthropology, neuroscience, medicine, public health, and epidemiology, all of which rely on at least some of the same methodologies as psychological researchers. Finally, this book should appeal to instructors who teach graduate and advanced undergraduate courses in psychological research methods and statistics, some of whom may elect to adopt the book as a supplemental text.

Format and Table of Contents This book is divided into three major sections: (I) Cross‐Cutting Challenges to Psychological Science; (II) Domain‐Specific Challenges to Psychological Science; and (III) Psychological and Institutional Obstacles to High‐Quality Psychological Science. To set the stage for what is to come, we summarize these three sections, as well as the chapters within each section, in the text that follows. Section I, Cross‐Cutting Challenges to Psychological Science, focuses on sweeping challenges to psychological science that cut across most or all subdisciplines, such as the problems posed by false positive findings, insufficient emphasis on replication, the decline effect, and the neglect of Bayesian considerations in data evaluation. ●●

●●

In Chapter 1, Brian Nosek and his colleagues at the Open Science Collaboration survey the landscape of the replicability challenges confronting psychology, and present a number of potential remedies for enhancing the reproducibility of psychological research. Among other suggestions, they underscore the value of pre-registering studies and study designs, performing confirmatory analyses prior to moving onto exploratory analyses, and sharing all study materials with one’s collaborators and the broader scientific community. In Chapter 2, Katherine S. Button and Marcus R. Munafò discuss the problem of low statistical power in psychological research and offer a user‐friendly tutorial on its impact on the detection of genuine effects. They discuss several potential reasons for the persistence of underpowered psychological research and examine

●●

●●

●●

●●

●●

Introduction: Psychological Science in Perspective

xix

several solutions to this lingering problem, including a heightened emphasis on large‐scale collaborative research and online data collection. In Chapter 3, Christopher J. Ferguson and Moritz Heene argue that contemporary psychological research is marked by an “an aversion to the null hypothesis,” that is, a reluctance to publish negative findings. They contend that this bias has resulted in a distorted picture of the magnitudes of effects across many psychological domains. The solutions, they maintain, include a greater willingness to publish replications, including unsuccessful replications, and a sea change in the academic culture, in which negative findings and conclusions are accorded greater value. Most of our discussion thus far has focused on false positives. Yet, as Klaus Fiedler and Malte Schott observe in Chapter 4, false negatives also pose a serious – and often insufficiently appreciated – challenge to psychological research. The authors discuss the statistical sources of false negative results and consider the relative costs of false positive and false negative findings in psychological research. They point out that although false positive results can often be ferreted out by means of subsequent unsuccessful replication efforts, false negative results may lead to fruitful lines of research being prematurely abandoned. Hence, in some domains of basic psychological science, they may be even more detrimental than false positives. In Chapter 5, Etienne P. LeBel and Leslie K. John address the problems posed by the lack of transparency in the reporting of psychological research. As they note, the veil of secrecy of which authors often avail themselves can allow them to engage with impunity in a host of questionable practices that can boost the odds of false positive results. The authors offer a number of suggestions for increasing transparency in data reporting, including public platforms for disclosing data and data analyses, and changes in journal editorial policies. Chapter 6, by John Protzko and Jonathan W. Schooler, explores the controversial topic of decline effects – the apparent decrease in the magnitude of effect sizes across numerous psychological domains. The authors present a novel taxonomy of decline effects and evaluate several potential reasons for the emergence of such effects, including regression to the mean and publication bias. As the authors point out, decline effects further underscore the importance of replication and meta‐analyses as tools for winnowing out genuine from artifactual findings. Chapter 7, authored by Joachim I. Krueger, examines the pervasive problem of reverse inference and its impact on the appraisal of psychological hypotheses and theories. Reverse inference occurs whenever we are reasoning backward from a behavior, thought, or emotion to a psychological state or trait. Psychologists engage in reverse inference whenever they posit a psychological state (e.g., fear) on the basis of activation of a specific brain region (e.g., amygdala); they also do so whenever they attempt to infer a psychological trait (e.g., extraversion) on the basis of endorsements of self‐report items (e.g., “I enjoy going to parties”). As Krueger notes, reverse inferences are more ubiquitous in psychology than most people assume, and they come with unappreciated interpretative challenges.

xx ●●

Introduction: Psychological Science in Perspective Chapter 8, co‐authored by Eric‐Jan Wagenmakers, Josine Verhagen, Alexander Ly, Dora Matzke, Helen Steingroever, Jeff N. Rouder, and Richard Morey, raises questions regarding one of the sacred cows of psychological research: statistical significance testing. The authors contend that this commonly accepted approach is far too lenient, as it does not account for the a priori likelihood of hypotheses. They maintain that a Bayesian approach, although introducing an inherent level of subjectivity that many psychologists resist, provides a much better alternative to the standard appraisal of theories.

Section II, Domain‐Specific Challenges to Psychological Science, focuses on challenges to psychological science that are specific to certain subdisciplines, such as functional neuroimaging, candidate gene studies of psychopathology, the efficacy of antidepressant medication, and parapsychological research. ●●

●●

●●

In Chapter 9, Anthony R. Pratkanis delineates what he terms the “partial, but real crisis in social psychology,” arguing that such distorting influences as intrinsic human biases, careerism, and the mounting pressures on researchers to generate “sexy” findings have damaged the scientific credibility of some subfields of contemporary social psychology. Pratkanis proposes a number of remedies to combat the recent ills afflicting social psychology, such as a focus on condition‐ seeking and a need to reform modal practices at psychological journals and granting agencies. He concludes by reminding us of Nobel Prize–winning physicist Richard Feynman’s maxim that the essence of science is bending over backward to prove ourselves wrong. In Chapter 10, Gregory Mitchell and Philip E. Tetlock take on one of the sacred cows of modern social psychology: implicit prejudice. They contend that much of the recent fascination with the conceptualization and measurement of implicit prejudice exemplifies the tendency of psychologically plausible claims to acquire a powerful foothold even in the absence of extensive supportive data. We suspect that not all readers will agree with Mitchell and Tetlock’s conclusions, but also suspect that all readers will find their discussion to be provocative and enlightening. In Chapter 11, Edward Vul and Harold Pashler expand on a now famous – some might say infamous – 2009 article in Perspectives on Psychological Science (by Vul, Harris, Winkielman, & Pashler; cited over 970 times as of this writing), which described the subtle methodological errors that can lead investigators to obtain remarkably high correlations (often above r = 0.90) between psychological states and traits, on the one hand, and brain activations, on the other. They provide readers with helpful suggestions for avoiding the “non‐independence problem” they identified, and highlight the importance of increasing statistical power and focusing more squarely on replication. As Vul and Pashler wisely observe, these methodological pitfalls are not unique to brain imaging research, and appear in slightly different guises in psychological literature, including personality assessment research.

●●

●●

●●

Introduction: Psychological Science in Perspective

xxi

Chapter 12, authored by Elizabeth Prom‐Wormley, Amy Adkins, Irwin D. Waldman, and Danielle Dick, examines some of the reasons why genetic association studies, such as those in the domain of psychopathology, have often proven difficult to replicate. They distinguish the hyperbole that has often characterized these studies in the past from a more realistic contemporary appraisal, and offer a host of methodological desiderata for researchers and consumers of the literature. They conclude by discussing the promises and perils of widely hyped studies of gene‐by‐environment interaction. In Chapter 13, Brett J. Deacon and Glen I. Spielmans address the contentious question of whether the efficacy of antidepressants and other psychotropic medications has been exaggerated. They contend that publication and outcome reporting biases, fueled by drug industry interests, have conspired to produce substantial overestimates of the efficacy of these medications, antidepressants in particular. Their chapter is a useful cautionary tale about the perils of seeking confirmation rather than refutation in applied science. Chapter 14, by Ray Hyman, examines the numerous pitfalls that have afflicted the field of parapsychology, and discusses the quixotic search for “psi,” a broad spectrum of paranormal phenomena that encompasses extrasensory perception and psychokinesis. As Hyman observes, Bem’s (2011) widely ballyhooed article on precognition is only the most recent high‐profile attempt to provide laboratory evidence for psi. Yet, as Hyman notes, all of these efforts have failed, despite more than 150 years of dedicated research. In addition, they have yielded a flurry of unreplicated positive findings, largely owing to an overreliance on statistical significance testing and the repeated insinuation of methodological flaws. Hyman’s chapter should be enlightening even for readers without an interest in parapsychology per se, as it points to a host of subtle methodological flaws that can afflict most or all psychological research.

Section III, Psychological and Institutional Obstacles to High‐Quality Psychological Science, focuses on psychological, sociological, and institutional obstacles that impede the progress of psychological science, including confirmation bias and preferences for “faddish” psychological questions. ●●

One of the foremost psychological impediments standing in the way of scientific progress, including progress in psychology, is confirmation bias, also termed “confirmatory bias.” This bias refers to a pervasive propensity to seek out and interpret evidence consistent with one’s hypotheses and to neglect or selectively reinterpret evidence that is not (Nickerson, 1998). In Chapter 15, Robert J. MacCoun and Nobel Laureate Saul Perlmutter address the problem of confirmatory bias in psychology and allied fields. They introduce a novel technique, blind analysis, which has already been used to good effect in some domains of physics (see also MacCoun & Perlmutter, 2015), for combatting the insidious impact of confirmatory bias. Psychologists would be well advised to heed their methodological suggestions.

xxii ●●

●●

Introduction: Psychological Science in Perspective

In Chapter 16, Marcus T. Boccaccini, David Marcus, and Daniel C. Murrie confront the thorny problem of allegiance effects in psychological research, with a particular focus on psychotherapy outcome research. As they note, evidence suggests that a disconcertingly large proportion of the variance in therapeutic outcomes appears to be attributable to investigators’ allegiances to their favored treatments. This allegiance problem may extend well beyond psychotherapy to the testing of many psychological theories: Researchers, after all, are rarely entirely neutral parties, and frequently have deep‐seated personal stakes in the outcomes of their studies. Boccaccini and colleagues conclude by presenting a number of potential remedies for minimizing the impact of allegiance biases, including adversarial collaborations, the use of theory‐neutral investigators, and statistical procedures for adjusting for such biases. In the book’s final chapter, Chapter 17, Robert J. Sternberg discusses the dangers of pursuing fads in psychological research. He wisely observes that a slavish pursuit of “hot” topics, especially those that are eminently fundable, may hamper creative endeavors by encouraging institutional conformity. Moreover, what is hot today may be cold tomorrow. Sternberg concludes by outlining eight categories of creative research approaches that should enhance the progress and long‐term health of psychological science.

Finally, the book concludes with an Afterword by Paul Bloom of Yale University, who places the recent challenges to psychological science in a broad historical and conceptual perspective, and who considers the implications of these challenges for the future of psychological research.

References Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., … & Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108–119. Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71, 230–244. Bartlett, T. (2014). Replication crisis in psychology research turns ugly and odd. Chronicle of Higher Education. Retrieved from http://chronicle. com/article/Replication‐Crisis‐in/ 147301. Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483, 531–533. Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. Berezow, A. K. (2015). Why psychology and statistics are not science. NewtonBlog. http:// www.realclearscience.com/blog/2015/11/why_psychology_and_statistics_are_not_ science.html

Introduction: Psychological Science in Perspective

xxiii

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376. Carey, B. (2011, November 2). Fraud case seen as a red flag for psychology research. New York Times. Available at http://www.nytimes.com/2011/11/03/health/research/noted‐dutch‐ psychologist‐stapel‐accused‐of‐research‐fraud.html?_r=0 Chan, A. W., Krleža‐Jerić, K., Schmid, I., & Altman, D. G. (2004). Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. Canadian Medical Association Journal, 171, 735–740. Cohen, J. (1962). The statistical power of abnormal‐social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. Doyen, S., Klein, O., Pichon, C. L., & Cleeremans, A. (2012). Behavioral priming: It’s all in the mind, but whose mind. PloS one, 7(1), e29081. Duval, S., & Tweedie, R. (2000). Trim and fill: a simple funnel‐plot–based method of testing and adjusting for publication bias in meta‐analysis. Biometrics, 56, 455–463. Fanelli, D. (2010). Positive results increase down the Hierarchy of the Sciences. PLoS ONE, 5(3). Ferguson, C. J. (2015). “Everybody knows psychology is not a real science:” Public perceptions of psychology and how we can improve our relationship with policymakers, the scientific community, and the general public. American Psychologist, 70, 527–542. Fischer, P., Krueger, J. I., Greitemeyer, T., Vogrincic, C., Kastenmüller, A., Frey, D., … & Kainbacher, M. (2011). The bystander‐effect: A meta‐analytic review on bystander intervention in dangerous and non‐dangerous emergencies. Psychological Bulletin, 137, 517–537. Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, 1502–1505. Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). More on “Estimating the Reproducibility of Psychological Science.” Science, 351, 1037. Gilovich, T. (1991). How we know what isn’t so: The fallibility of human reason in everyday life. New York: Free Press. Goodman, S., & Greenland, S. (2007). Why most published research findings are false: Problems in the analysis. PLoS Medicine, 4(4), 773. Hartsfield, T. (2015). Statistics shows that psychology is not a science. NewtonBlog. http:// www.realclearscience.com/blog/2015/11/the_trouble_with_social_science_statistics. html Hedges, L. V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research. American Psychologist, 42, 443–455. Hunt, M. (2009). The story of psychology. New York: Anchor. Ioannidis, J. P. (2005). Why most published research findings are false. Chance, 18, 40–47. Ioannidis, J. P., Munafo, M. R., Fusar‐Poli, P., Nosek, B. A., & David, S. P. (2014). Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends in Cognitive Sciences, 18, 235–241. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524–532. Johnsen, T. J., & Friborg, O. (2015). The effectiveness of cognitive behavioral therapy as an anti‐depressive treatment is falling: A meta‐analysis. Psychological Bulletin, 14, 747–768.

xxiv

Introduction: Psychological Science in Perspective

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196–217. Lehrer, J. (2010, December 13). The truth wears off. The New Yorker, 13, 52–57. Lilienfeld, S. O. (2010). Can psychology become a science? Personality and Individual Differences, 49, 281–288. Lindsay, D. S. (2015). Replication in psychological science. Psychological Science, 26, 1027–1032. Luborsky, L., Diguer, L., Seligman, D. A., Rosenthal, R., Krause, E. D., Johnson, S., … & Schweizer, E. (1999). The researcher’s own therapy allegiances: A “wild card” in comparisons of treatment efficacy. Clinical Psychology: Science and Practice, 6, 95–106. Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70, 151–159. MacCoun, R., & Perlmutter, S. (2015). Blind analysis: Hide results to seek the truth. Nature, 526, 187–189. McKay, D., & Lilienfeld, S. O. (2015). Is psychology really in shambles? A response to Jeffrey Lieberman. Society for a Science of Clinical Psychology. http://www.sscpweb.org/ Media‐Posts/3515419 Meehl, P. E. (1986). What social scientists don’t understand. In D. W. Fiske & R. A. Shweder (Eds.), Meta‐theory in social science (pp. 315–338). Chicago: University of Chicago Press. Mitchell, J. (2014). On the emptiness of failed replications. http://wjh.harvard.edu/~jmitchel/ writing/failed_science.htm. Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2, 175–220. Nuijten, M. B., Hartgerink, C. H., van Assen, M. A., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods. O’Donohue, W. T., Lilienfeld, S. O., & Fowler, K. A. (2007). Science is an essential safeguard against human error. In W. T. O’Donohue & S. O. Lilienfeld (Eds.), The great ideas of clinical science: 17 principles that every mental health professional should understand (pp. 3–27). New York, NY: Routledge. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530. Popper, K. R. (1959). The logic of scientific discovery. London, UK: Hutchinson. Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall”effect. PloS one, 7(3), e33423. Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638–641. Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646–656. Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. Simons, D. J. (2014). The value of direct replication. Perspectives on Psychological Science, 9, 76‐80.

Introduction: Psychological Science in Perspective

xxv

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P‐curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681. Tajika, A., Ogawa, Y., Takeshima, N., Hayasaka, Y., & Furukawa, T. A. (2015). Replication and contradiction of highly cited research papers in psychiatry: 10‐year follow‐up. The British Journal of Psychiatry, 207, 357–362. Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison‐Wesley. Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4, 274–290. Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426–432. Waldman, I. D., & Lilienfeld, S. O. (2015). Thinking about data, research methods, and statistical analyses: Commentary on Sijtsma’s (2014) “Playing with Data.” Psychometrika, 81, 16–26. Walum, H., Waldman, I., & Young, L. (2015). Statistical and methodological considerations for the Interpretation of intranasal oxytocin studies. Biological Psychiatry. Advance online publication. doi:10.1016/j.biopsych.2015.06.016 Yong, E. (2012). Replication studies: bad copy. Nature, 485, 298–300. Zimbardo, P. G. (2004). Does psychology make a significant difference in our lives? American Psychologist, 59, 339–351.

Part I

Cross‐Cutting Challenges to Psychological Science

1

Maximizing the Reproducibility of Your Research Open Science Collaboration1

Commentators in this book and elsewhere describe evidence that modal scientific practices in design, analysis, and reporting are interfering with the credibility and veracity of published literature (Begley & Ellis, 2012; Ioannidis, 2005; Miguel et al., 2014; Simmons, Nelson, & Simonsohn, 2011; see Chapters 2 and 3). The reproducibility of published findings appears to be lower than many would expect or desire (Fuchs, Jenny, & Fiedler, 2012; Open Science Collaboration, 2015; Pashler & Wagenmakers, 2012). Further, common practices that interfere with reproducibility are maintained by incentive structures that prioritize innovation over accuracy (Nosek, Spies, & Motyl, 2012). Getting deeper into the metascience literature reviewing scientific practices might lead to a discouraging conclusion for the individual scientist – “I cannot change the system on my own, so what should I do?” This chapter provides concrete suggestions for increasing the reproducibility of one’s own research. We address reproducibility across the research lifecycle: project planning, project implementation, data analysis, reporting, and programmatic research strategies. We also attend to practical considerations for surviving and thriving in the present scientific culture, while simultaneously promoting a cultural shift toward transparency and reproducibility through the collective effort of independent scientists and teams. As such, practical suggestions to increase research credibility can be incorporated easily into the daily workflow without requiring substantial additional work in the short term, and perhaps saving substantial time in the long term. Further, journals, granting agencies, and professional organizations are adding recognition and incentives for reproducible science such as badges for open practices (Kidwell et al., 2016) and the TOP Guidelines for journal and funder transparency policies (Nosek et al., 2015). Doing reproducible science will increasingly be seen as the way to advance one’s career, and this chapter may provide a means to get a head start. Psychological Science Under Scrutiny: Recent Challenges and Proposed Solutions, First Edition. Edited by Scott O. Lilienfeld and Irwin D. Waldman. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc.

4

Open Science Collaboration

Project Planning Use high‐powered designs Within the nearly universal null hypothesis significance testing (NHST) framework, there are two inferential errors that can be made: (1) falsely rejecting the null hypothesis (i.e., believing that an effect exists, even though it does not), and (2) falsely failing to reject it when it is false (i.e., believing that no effect exists, even though it does). “Power” is the probability of rejecting the null hypothesis when it is false, given that an effect actually exists (see Chapters 3 and 4). Power depends on the size of the investigated effect, the alpha level, and the sample size.2 Low statistical power undermines the purpose of scientific research; it reduces the chance of detecting a true effect, but also, perhaps less intuitively, reduces the likelihood that a statistically significant result reflects a true effect (Ioannidis, 2005). The problem of low statistical power has been known for over 50 years: Cohen (1962) estimated that, in psychological research, the average power of studies to detect small and medium effects was 18% and 48%, respectively, a situation that had not improved almost 25 years later (Sedlmeier & Gigerenzer, 1989). More recently, Button and colleagues (Button et al., 2013) showed that the median statistical power of studies in the neurosciences is between 8% and 31%. Considering that many of the problems of low power are well known and pernicious, it should be surprising that low‐power research is still the norm. Some reasons for the persistence of low‐powered studies are: (1) resources are limited, (2) researchers know that low power is a problem but do not appreciate its magnitude, and (3) there are insidious, perhaps unrecognized, incentives for engaging in low‐ powered research when publication of positive results is the primary objective. That is, it is easier to obtain false positive results with small samples, particularly by using one’s limited resources on many small studies rather than one large study (Bakker, van Dijk, & Wicherts, 2012; Button et al., 2013; Ioannidis, 2005; Nosek et al., 2012). Given the importance of publication for academic success, these are formidable barriers. What can you do? To start with, consider the conceptual argument countering the publication incentive. If the goal is to produce accurate science, then adequate power is essential. When studying true effects, higher power increases the likelihood of detecting them. Further, the lure of publication is tempting, but the long‐term benefits are greater if the published findings are credible. Which would you rather have: more publications with uncertain accuracy, or fewer publications with more certain accuracy? Doing high‐powered research will take longer, but the rewards may last longer. Recruiting a larger sample is an obvious benefit, when feasible. There are also design strategies to increase power without more participants. For some studies, it is feasible to apply within‐subject and repeated‐measurement designs. These approaches are more powerful than between‐subject and single‐measurement designs. Repeated‐measures designs allow participants to be their own controls,

Maximizing the Reproducibility of Your Research

5

reducing data variance. Also, experimental manipulations are powerful, as they minimize confounding influences. Further, reliable outcome measures reduce measurement error. For example, all else being equal, a study investigating hiring practices will have greater power if participants make decisions about many candidates compared to an elaborate scenario with a single dichotomous decision about one candidate. Finally, standardizing procedures and maximizing the fidelity of manipulation and measurement during data collection will increase power. A complementary approach for doing high‐powered research is collaboration. When a single research group cannot achieve the sample size required to provide sufficient statistical power, multiple groups can administer the same study materials, and then combine data. For example, the first “Many Labs” replication project administered the same study across 36 samples, totaling more than 6,000 participants, producing both extremely high‐powered tests of the effects and sufficient data to test for variability across samples and settings (Klein et al., 2014). Likewise, large‐scale collaborative consortia in fields such as human genetic epidemiology have transformed the reliability of findings in these fields (Austin, Hair, & Fullerton, 2012). Even just combining efforts across three or four labs can increase power dramatically while minimizing the labor and resource impact on any one contributor. Moreover, concerns about project leadership opportunities for publishing can be minimized with quid pro quo agreements – “you run my study, I’ll run yours.”

Create an analysis plan Researchers have many decisions to make when conducting a study and analyzing data. Which data points should be excluded? Which conditions and outcome variables are critical to assess? Should covariates be included? What variables might moderate the key relationship? For example, Carp (2012a) found that, among 241 studies using functional magnetic resonance imaging (fMRI), there were 223 unique combinations of data cleaning and analysis procedures (e.g., correction for head motion, spatial smoothing, temporal filtering; see Chapter 11). The inordinate flexibility in analysis options provides researchers with substantial degrees‐of‐freedom to keep analyzing the data until a desired result is obtained; Carp (2012b) reports that, when using the over 30,000 possible combinations of analysis methods on a single neuroimaging experiment, 90.3% of brain voxels differed significantly between conditions in at least one analysis. This flexibility could massively inflate false positives (Simmons et al., 2011; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). The best defense against inflation of false positives is to reduce the degrees of freedom available to the researcher by writing down, prior to analyzing the data, how the data will be analyzed. This is the essence of confirmatory data analysis (Wagenmakers et al., 2012). The key effect of committing to an analysis plan in advance is the preservation of the meaning of the p‐values resulting from the analysis. The p‐value is supposed to indicate the likelihood that these data would have occurred if there was

6

Open Science Collaboration

no effect to detect. This interpretation is contingent on how many tests on the data were run and reported. Once the data have been observed, the universe of possible tests may be reduced to those that appear to be differences, and tests that do not reveal significant effects may be ignored. Without an a priori analysis plan, p‐values lose their meaning and the likelihood of false positives increases. Writing down an analysis plan in advance stimulates a more thorough consideration of potential moderators and controls, and also induces a deeper involvement with the previous research and formulated theories. By committing to a pre‐specified analysis plan, one can avoid common cognitive biases (Kunda, 1990; Nosek et al., 2012). This approach also allows researchers to be open about and rewarded for their exploratory research (Wagenmakers et al., 2012), and highlights the value of conducting pilot research in order to clarify the qualities and commitments for a confirmatory design.

Project Implementation Determine data collection start and stop rules It is not uncommon for researchers to peek at their data and, when just shy of the “magical” alpha = 0.05 threshold for significance, to add participants to achieve significance (John, Loewenstein, & Prelec, 2012). This is problematic because it inflates the false positive rate (Simmons et al., 2011). Likewise, particularly with difficult‐to‐ collect samples (e.g., infant studies, clinical samples), there can be ambiguity during pilot testing about when a study is ready to begin. A few particularly “good” participants might be promoted to the actual data collection if the status of piloting versus actual data collection is not clear. Defining explicit data collection start and stop rules is effective self‐protection against false positive inflation. These could be defined as a target number of participants per condition, a target period of time for data collection, as a function of an a priori power analysis, or by any other strategy that removes flexibility for deciding when data collection begins and ends (Meehl, 1990). Some journals, such as Psychological Science, now require disclosure of these rules.

Register study and materials Many studies are conducted and never reported. This “file‐drawer effect” is a major challenge for the credibility of published results (Rosenthal, 1979; see Chapter 3). Considering only the likelihood of reporting a null result versus a positive result and ignoring the flexibility in analysis strategies, Greenwald (1975) estimated the false positive rate to be greater than 30%. However, it is difficult to imagine that every study conducted will earn a full write‐up and published report.

Maximizing the Reproducibility of Your Research

7

A more modest solution is to register every study at the onset of data collection in a public registry. Registration involves, at minimum, documentation of the study design, planned sample, and research objectives. Registration ensures that all conducted studies are discoverable, and facilitates the investigation of factors that may differentiate the universe of studies conducted from the universe of studies published. Public registration of studies is required by law in the United States for clinical trials (De Angelis et al., 2004), and is a pre‐condition for publication in many major medical journals. The 2013 Declaration of Helsinki, a possible bellwether of ethical trends, recommends that this requirement be extended to all studies involving human participants (http://www.wma.net/en/30publications/10policies/b3). This movement toward more transparency of all research can improve accessibility of findings that were not published in order to evaluate potential biases in publishing and aggregate all evidence for a phenomenon. A common concern about public registration is that one’s ideas may be stolen by others before the research is completed and published. Registration actually certifies the originator of ideas with a time and date stamp. But, for the cautious researcher, some modern registries allow researchers to register studies privately and then reveal the registration later (e.g., https://osf.io, described later).

Data Analysis Perform confirmatory analyses first For confirmatory analyses to retain their interpretability, they must be conducted and reported in full. Consider, for example, pre‐registering 20 unique tests and reporting the single test that achieved a p‐value below 0.05. Selective reporting renders a confirmatory analysis plan irrelevant. Likewise, a confirmatory analysis plan does not eliminate interpretability challenges of multiple comparisons. So, disclosing all 20 registered tests does not make the one significant result less vulnerable to being a false positive. The key for registering an analysis plan is that it constrains the initial analyses conducted and makes clear that any potential Type 1 error inflation is limited to those confirmatory analyses. In the ideal confirmatory analysis, the analysis script is created in advance and executed upon completion of data collection. In some cases, this ideal will be difficult to achieve. For example, there may be honest mistakes in the pre‐ registration phase or unforeseen properties of the data – such as non‐normal data distributions or lack of variation in a key variable – that make deviations from the original analysis plans necessary. Having an analysis plan – with whatever degree of specificity is possible – makes it easy to clarify deviations from a strictly confirmatory analysis; explanations of those deviations make it easier to judge their defensibility.

8

Open Science Collaboration

Conduct exploratory analysis for discovery, not for hypothesis testing Exploratory analysis is a valuable part of data analysis (Tukey, 1977). Much of the progress made in science is through accidental discoveries of questions and hypotheses that one did not think to have in advance (Jaeger & Halliday, 1998). The emphasis on confirmatory designs does not discourage exploratory practice. Rather, it makes explicit the difference between outcomes resulting from confirmatory and exploratory approaches. In exploratory analysis, inductive reasoning is used to form tentative a posteriori hypotheses that explain the observations (Stebbins, 2001). Popper (1959) proposed that a hypothesis derived from a given set of observations cannot be falsified by those same observations. As Popper noted, “a hypothesis can only be empirically tested – and only after it has been advanced” (p. 7). Making explicit the distinction between confirmatory and exploratory analysis helps clarify the confidence in the observed effects, and emphasizes the fact that effects from exploratory analysis require additional investigation.

Test discoveries with confirmatory designs With discovery in hand, the temptation for publication is understandable. Replication offers only the dreaded possibility of “losing” the effect (Nosek et al., 2012). There may be no palliative care available other than to point out that, while many exploratory results are opportunities to develop hypotheses to be tested, they are not the hypothesis tests themselves. The long‐term view is that it is better to learn quickly that the effect is irreproducible than to expend resources on extensions that falsely assume its veracity. Following a discovery with a high‐powered confirmatory test is the single best way to enhance the credibility and reproducibility of research findings. This point is not universal. There are instances for which the effects in exploratory analysis are estimated with such precision that it is highly unlikely that they are chance findings. However, the circumstances required for this are uncommon in most psychological applications. The most common cases are studies with many thousands of participants. Even in these cases, it is possible to leverage chance and exaggerate results. Further, data collection circumstances may not be amenable to conducting a confirmatory test after an exploratory discovery. For example, with extremely hard‐to‐access samples, the effort required to conduct a confirmatory test may exceed available resources. It is simply a fact that clarifying the credibility of findings can occur more quickly for some research applications compared to others. For example, research on the development of infant cognition often requires laborious laboratory data collections with hard‐to‐reach samples. Adult personality investigations, on the other hand, can often be administered via the Internet to hundreds of people simultaneously. The former will necessarily accumulate information and knowledge more slowly than the latter.

Maximizing the Reproducibility of Your Research

9

These constraints do not exempt research areas from the tentativeness of exploratory results and the need for confirmatory investigations. Rather, because of practical constraints, some research applications may need to tolerate publication of more tentative results and slower progress in verification.

Keep records of analyses Some challenges to reproducibility are more a function of deficient record‐keeping than analysis and reporting decisions. Analysis programs, such as SPSS, provide easy‐to‐use point‐and‐click interfaces for conducting analyses. The unfortunate result is that it can be very easy to forget the particulars of an analysis if only the output persists. A simple solution for increasing reproducibility is to retain scripts for exactly the analyses that were conducted and reported. Coupled with the data, re‐executing the scripts would reproduce the entire analysis output. This is straightforward with script‐based analysis programs such as R, STATA, and SAS, but is also easy with SPSS by simply generating and saving the scripts for the conducted analyses. Taking this simple step also offers practical benefits to researchers beyond improved reproducibility. When analysis procedures are carried out without the use of scripts, adding new data points, revising analyses, and answering methodological questions from reviewers can be both time‐consuming and error‐prone. Using scripts makes these tasks incomparably simpler and more accurate.

Share and review data and analysis scripts among collaborators Science is often done in teams. In most cases, team members have some specialization such as one member developing the research materials and another conducting analysis. In most cases, all members of a collaboration should have access to all of the research materials and data from a study. At minimum, shared access ensures that any member of the team could find the materials or data if other team members were not available. Moreover, sharing materials and data increases the likelihood of identifying and correcting errors in design and analysis prior to publication. For example, in software development, code review – the systematic evaluation of source code – is common practice to fix errors and improve the quality of the code for its purpose and reusability (Kemerer & Paulk, 2009; Kolawa & Huizinga, 2007). Such practices are easy to incorporate into scientific applications, particularly of analysis scripts, in order to increase confidence and accuracy in the reported analyses. Finally, sharing data and analysis scripts with collaborators increases the likelihood that both will be documented so that they are understandable. For the data analyst, it is tempting to forgo the time required to create a codebook and clear documentation of one’s analyses because, at the moment of analysis, the variable names and meaning of the analysis are readily available in memory. However, 6 months later, when the editor requires additional analysis, it can be hard to recall what VAR0001 and VAR0002

10

Open Science Collaboration

meant. Careful documentation of analyses and methods, along with data codebooks, increase reproducibility by making it easier for someone else, including your future self, to understand and interpret the data and analysis scripts (Nosek, 2014). Sharing with collaborators is a means of motivating this solid practice that otherwise might feel dispensable in the short term, but becomes a substantial time saver in the long term.

Archive materials and data Wicherts, Borsboom, Kats, and Molenaar (2006) tried to obtain original datasets from 249 studies in order to reproduce the reported results. They found that the major barrier to reproducibility was not errors in the datasets; it was not being able to access the dataset at all. Just 26% of the datasets were available for reanalysis. In a more recent case, Vines and colleagues (2013) found that just 23% of 516 requested datasets were available, and the availability of datasets declined by 7% per year over the 20‐year period they studied. Further, Vines and colleagues observed that the working rate of email addresses of corresponding authors fell by 7% per year over the same span. In sum, reproducibility and reanalysis of data is most threatened by the gradual loss of information through the regular laboratory events of broken machines, rotating staff, and mismanagement of files. The potential damage for one’s own research and data management are substantial. Researchers routinely return to study designs or datasets as their research programs mature. If those materials and data are not well maintained, there is substantial loss of time and resources in trying to recover prior work. Considering the substantial resources invested in obtaining the data and conducting the research, these studies reveal a staggering degree of waste of important scientific resources. It does not need to be this way. There are now hundreds of repositories available for archiving and maintaining research materials and data. If researchers adopt the strategy of sharing research materials and data among collaborators, then it is a simple step to archive those materials for purposes of preservation and later recovery.

Reporting Disclose details of methods and analysis The Publication Manual of the American Psychological Association, Sixth Edition (2010), a popular style guide for report writing, suggests that methods sections need to report sufficient detail so that a reader can reasonably replicate the study (p. 29). And, for reporting analyses, authors should “mention all relevant results, including those that run counter to expectation” (p. 32), and “include sufficient information to help the reader fully understand the analyses conducted,” minimally including “the per‐cell sample size, the observed cell means (or frequencies of cases in each category for a categorical variable), and the cell standard deviations, or the pooled within‐cell variance” (p. 33).

Maximizing the Reproducibility of Your Research

11

Even a cursory review of published articles reveals that these norms are rarely met in modal research practices (see Chapter 5). And yet, complete methodology and analysis description is vital for reproducibility. In the ideal report, a reader should be able to identify the conditions necessary to conduct a fair replication of the original research design, and have sufficient description of the analyses to reproduce them on the same or a new dataset. Without full description, the replication will inevitably contain many unintended differences from the original design or analysis that could interfere with reproducibility. There are occasions in which some critical elements of a research design will not fit into a written report – either because of length restrictions or because the design elements cannot be described in words effectively. For both, there are readily available alternatives. Supplementary materials, which most journals now support online during review and after publication, allow more comprehensive descriptions of methodology. Photo or video simulations of research designs can clarify key elements that are not easy to describe. What should be included in methods descriptions will vary substantially across research applications. An example of guidelines for effective reporting of methods and results was developed by the Research Committee at the Tilburg University Social Psychology Department (2013; see http://www.academia.edu/2233260/Manual_for_Data_Sharing_‐_ Tilburg_University). Whereas preparing comprehensive descriptions of research methods may add to the time required to publish a paper, it also has the potential to increase the impact of the research. Independent scientists interested in replicating or extending the published findings may be more likely to do so if the original report describes the methods thoroughly. And detailed methodological reporting increases the chances that subsequent replication attempts will faithfully adhere to the original methods, increasing the odds that the findings are replicated and the original authors’ reputations enhanced.

Follow checklists for good reporting practices The APA manual provides specific guidance for style and general guidance for content of reporting. Following revelations of substantial opportunity for (Simmons et al., 2011) and exploitation of (John et al., 2012) flexibility in data analysis and reporting, new norms are emerging for standard disclosure checklists of research process. Following Simmons et al. (2011) and LeBel and colleagues (2013), Psychological Science has established four items that must be disclosed in all its articles (Eich, 2013): (1) how samples sizes were determined, (2) how many observations, if any, were excluded, (3) all experimental conditions that were tested, including failed manipulations, and (4) all items and measurements that were administered (see Chapter 5). These are easy to implement for any report, regardless of journal, and they disclose important factors where researchers may take advantage of, or avoid, leveraging chance in producing research findings.

12

Open Science Collaboration

More generally, checklists can be an effective way of making sure desired behaviors are performed (Gawande, 2009). There are a variety of checklists emerging for particular research practices and reporting standards. For example: (1) CONSORT is a checklist and reporting standard for clinical trials (Moher et al., 2010); (2) the ARRIVE checklist has a similar purpose for animal research (Kilkenny, Browne, Cuthill, Emerson, & Altman, 2010); (3) Kashy and colleagues (Kashy, Donnellan, Ackerman, & Russell, 2009) provided recommendations for methods and results reporting for authors of articles in Personality and Social Psychology Bulletin that have wider applicability; (4) Poldrack and colleagues (2008) offered a reporting standards checklist for fMRI analysis pipelines (see Chapter 11); (5) Klein and colleagues (2012) suggested standard reporting of participant and experimenter characteristics for behavioral research; (6) Brandt and colleagues (2014) offered 36 questions to address for conducting effective replications; (7) members of a laboratory and course at the University of Virginia generated three brief checklists for managing research workflow, implementing a study, and reporting the results to facilitate transparency in research practices (Open Science Collaboration, 2012c); and (8) the headings of this chapter can serve as a checklist for reproducibility practices, as presented in Table 1.1.

Table 1.1 Increasing the reproducibility of psychological research across the research lifecycle. Project Planning 1. Use high‐powered designs 2. Create an analysis plan Project Implementation 3. Determine data collection start and stop rules 4. Register study and materials Data Analysis 5. Perform confirmatory analyses first 6. Conduct exploratory analysis for discovery, not for hypothesis testing 7. Test discoveries with confirmatory designs 8. Keep records of analyses 9. Share and review data and analysis scripts among collaborators 10. Archive materials and data Reporting 11. Disclose details of methods and analysis 12. Follow checklists for good reporting practices 13. Share materials and data with the scientific community 14. Report results to facilitate meta‐analysis Programmatic Strategies 15. Replicate‐and‐extend 16. Participate in crowdsourced research projects 17. Request disclosure as a peer reviewer

Maximizing the Reproducibility of Your Research

13

Share materials and data with the scientific community When Wicherts et al. (2006) received just 26% of the requested datasets of published articles, they speculated that the low response rate was primarily a function of the time and effort it takes for researchers to find, prepare, and share their data and code books after publication. It is also possible that some were reluctant to share because the present culture perceives such requests as non‐normative and perhaps done in an effort to discredit one’s research. Explicit, widespread embrace of openness as a value for science may help neutralize this concern. More directly to the point, when materials and data are archived from the start of the research process, it will be much easier for researchers to adhere to data‐sharing requests. Some archiving solutions make it trivially easy to move a private repository into public or controlled access. Researchers who shared their materials and data with collaborators in a web‐based archive can select which of those materials and data to release to the public. This may be particularly helpful for addressing the file‐drawer effect. For those studies that researchers do not intend to write up and publish, their presence in a registry and public access to the materials and data ensures their iscoverability for meta‐analysis and assists researchers investigating similar d questions in informing their research designs. Sharing research materials and data is not without concern. First, researchers may be concerned about the amount of work that will be required from them once method and data sharing becomes the standard. However, if researchers incorporate the expectation of sharing materials and data with collaborators, and potentially more publicly, into their daily workflow, sharing becomes surprisingly easy and encourages good documentation practices that assists the researcher’s own access to the materials and data in the future. This may even save time and effort in the long run. Second, some data collections require extraordinary effort to collect and are the basis for multiple publications. In such cases, researchers may worry about the cost– benefit ratio of effort expended to obtain the data against the possibility of others’ using the data before they have had sufficient time to develop their own published research from it. There are multiple ways to address this issue, including: (a) releasing the data in steps, exposing only the variables necessary to reproduce published findings; (b) establishing an embargo period during which the original authors pursue analysis and publication, but then open the data to others following that; or (c) embracing the emerging evidence that open data leads to greater scientific output and impact (Piwowar & Vision, 2013). Further, there are journals such as the Journal of Open Psychology Data (http://openpsychologydata.metajnl.com) and organizational efforts such as Datacite (http://www.datacite.org) that make datasets themselves citable and a basis for earning reputation and citation impact. Finally, the most potent concern is protecting participant privacy with human participant research. At all times, the individual researcher bears fundamental responsibility to meet this ethical standard. Data sharing cannot compromise participants’ rights and well‐being. For many research applications, making the data

14

Open Science Collaboration

anonymous is relatively easy to do by removing specific variables that are not essential for reproducing published analyses. For other research applications, a permissions process may be needed to obtain datasets with sensitive information. In summary, reproducibility will be maximized if the default practice for materials and data is to share them openly. Restrictions on open data are then the exceptions to the default practice. There are many defensible reasons for closing access, particularly to data. Those reasons should be made explicit in each use case.

Report results to facilitate meta‐analysis A single study rarely settles a scientific question. Any single finding could be upwardly or downwardly biased (i.e., larger or smaller than the true effect, respectively) due to random or systematic sources of variance. Meta‐analysis addresses this concern by allowing researchers to model such variance and thereby provides summary estimates worthy of increased confidence. However, if the sources used as input to meta‐analyses are biased, the resulting meta‐analytic estimates will also be biased. Biased meta‐analytic findings are especially problematic because they are more likely than primary studies to reach scientific and practitioner audiences. Therefore, they affect future research agendas and evidence‐based practice (Kepes & McDaniel, 2013). Individual researchers can facilitate effective aggregation of research evidence by (a) making their own research evidence – published and unpublished – available for discovery by meta‐analysts, and (b) structuring the results reports so that the required findings are easy to find and aggregate. The first is addressed by following the archiving and sharing steps described previously. The second is facilitated by ensuring that effect sizes for effects of interest and all variable pairs are available in the report or supplements. For example, authors can report a correlation matrix, which serves as an effect size repository for a variety of variable types (Dalton, Aguinis, Dalton, Bosco, & Pierce, 2012).

Programmatic Strategies Replicate‐and‐extend The number of articles in psychology explicitly dedicated to independent, direct replications of research appears to be 1% or less of published articles (Makel, Plucker, & Hegarty, 2012). It would be easy to conclude from this that psychologists do not care about replicating research, and that journals reject replication studies routinely because they do not make a novel enough contribution. However, even when researchers are skeptical of the value of publishing replications, they may agree that replication‐and‐extension is a profitable way to meet journals’ standards for innovation while simultaneously increasing confidence in existing findings.

Maximizing the Reproducibility of Your Research

15

A great deal of replication could be carried out in the context of replicate‐and‐ extend paradigms (Nosek et al., 2012; Roediger, 2012). Researchers may repeat a procedure from an initial study within the same paper, adding conditions or measures, but also preserving the original design. For example, a Study 2 might include two conditions that replicate Study 1 (disgust prime and control), but also add a third condition (anger prime), and a second outcome measure. Thus, Study 2 offers a direct replication of the Study 1 finding, with an extension comparing those original conditions to an anger prime condition. This provides greater certainty about the reproducibility of the original result than a Study 2 that tests the same hypothesis after changing all the operationalizations.

Participate in crowdsourced research projects The prior section alluded to the fact that some challenges for reproducibility are a function of the existing culture strongly prioritizing innovation over verification (Nosek et al., 2012). It is not worth researchers’ time to conduct replications or confirmatory tests if they are not rewarded for doing so. Similarly, some problems are not theoretically exciting, but would be practically useful for developing standards or best practices for reproducible methodologies. For example, the scrambled sentence paradigm is used frequently to make particular thoughts accessible that may influence subsequent judgment (e.g., Bargh, Chen, & Burrows, 1996). Despite being a frequently used paradigm, there is no direct evidence for which procedural features optimize the paradigm’s effectiveness, and there is great variation in operationalizations across studies. Optimizing the design would be very useful for maximizing power and reproducibility, but conducting the required studies would be time consuming with uncertain reward. Finally, some problems are acknowledged to be important, but are too large to tackle singly. It is difficult for individual researchers to prioritize doing any of these when confronted with the competitive nature of getting a job, keeping a job, and succeeding as an academic scientist. One solution for managing these incentive problems is crowdsourcing. Many researchers can each contribute a small amount of work to a larger effort. The accumulated contribution is large, and little risk is taken on by any one contributor. For example, the Reproducibility Project: Psychology investigated the predictors of reproducibility of psychological science by replicating a large sample of published findings. More than 350 researchers (270 earning co‐authorship) worked together with many small teams, each conducting a replication following a standardized protocol (Open Science Collaboration, 2012a, 2012b, 2015). Another approach is to incorporate replications into teaching. This can address the incentives problem and provide pedagogical value simultaneously (Frank & Saxe, 2012; Grahe et al., 2012). The CREP project (https://osf.io/wfc6u; Grahe, Brandt, IJzerman, & Cohoon, 2014) identifies published research for which replication could be feasibly incorporated into undergraduate methods courses. Also, the Archival Project (http://archivalproject.org) integrates crowdsourcing and

16

Open Science Collaboration

pedagogical value with a crowdsourced effort to code articles to identify the rates of replications and characteristics of methods and results in the published literature.

Request disclosure as a peer reviewer Individual researchers can contribute to promoting a culture of reproducibility by adapting their own research practices, and also by asking others to do so in the context of their roles as peer reviewers. Peer reviewers have influence on the articles they review and, in the aggregate, on editors and standard journal practices. The Center for Open Science (http://cos.io) maintains a standard request, which peer reviewers can include in their reviews of empirical research to promote a culture of transparency: I request that the authors add a statement to the paper confirming whether, for all experiments, they have reported all measures, conditions, data exclusions, and how they determined their sample sizes. The authors should, of course, add any additional text to ensure the statement is accurate. This is the standard reviewer disclosure request endorsed by the Center for Open Science [see also http://osf.io/hadz3]. I include it in every review.

Including this as a standard request in all reviews can (a) show the broad interest in making the disclosure a standard practice, and (b) emphasize it as a cultural norm and not an accusatory stance toward any individual. A culture of transparency works best if all members of the culture are expected to abide by it.

Implementing These Practices: An Illustration with the Open Science Framework There are a variety of idiosyncratic ways to implement the practices discussed in this chapter. Here, we offer an illustration using an open‐source web application that is maintained by the Center for Open Science, called the Open Science Framework (OSF; http://osf.io). All of the practices summarized here can be supported by the OSF (see also Chapter 5). Organize a research project. The research workflow in the OSF begins with the creation of a project. The creator provides the title and description, uploads files, writes documentation via the wiki, and adds contributors. Users can create project components to organize the project into conceptual units. For example, a survey research project might include one component for study design and sampling procedures, another for survey instruments, a third for raw data, a fourth for data analysis, and a fifth for the published report. Each component has its own list of contributors and privacy settings. For example, the lead investigators of a project

Maximizing the Reproducibility of Your Research

17

may decide to grant access to the data‐coding components to research assistant collaborators, but to deny those collaborators permission to modify the data analysis components. Create an analysis plan. Once the investigator has organized the project and added contributors, he or she might then add the analysis plan. The investigator might create a new component for the analysis plan, upload analysis scripts and sample codebooks, and write a narrative summary of the plan in the component wiki. Register study and materials. Once the investigator is ready to begin data collection, he or she might next register the study and materials. Materials are often used between studies and may evolve; registration at this point ensures that the exact materials used in the study are preserved. To do so, the investigator would click a button to initiate a registration and provide some description about what is being registered. Once created, this registration becomes a frozen copy of the project as it existed at the moment it was registered. This frozen copy is linked to the project, which the researchers may continue to edit. Thus, by creating a registration, the investigator can later demonstrate that his or her published analysis matched his or her original plan – or, if any changes were necessarily, detail what was changed and why. Keep records of analyses. As the research team collects data and conducts analysis, the tools used to generate the analysis and records of how those tools were used can be added to the data analysis component of the project. These might include analysis or data‐cleaning scripts written using Python, R, or SPSS; quality‐ checking procedures; or instructions for running these scripts on new data. The OSF records all changes made to project components, so the research team can easily keep track of what changed, when it changed, and who changed it. Prior versions are retained and recoverable. Share materials and data. At any point during the research life cycle, the team may choose to make some or all of their work open to the public. OSF users can make a project or one of its components public by clicking on the “Make Public” button on the dashboard of each project. Researchers can also independently control the privacy of each component in a project; for example, an investigator may decide to make his or her surveys and analysis plan public, but make the raw data private to protect the identities of the research participants. Replicate and extend. Once the investigator’s project is complete, independent scientists may wish to replicate and extend his or her work. If the original investigator made some or all of his or her work public, other OSF users can create an independent copy (or a “fork”) of the project as a starting point for their own investigations. For example, another OSF user might fork the original researcher’s data collection component to use the surveys in a new study. Similarly, another researcher planning a meta‐analysis might fork the original raw data or data analysis components of several OSF projects to synthesize the results across studies. The source project/component is maintained, creating a functional citation network – the original contributors credit is forever maintained.

18

Open Science Collaboration

Conclusion We started this chapter, concerning how to improve reproducibility, with a question: “What can I do?” We intend the suggestions made in this chapter to provide practical answers to that question. When researchers pursue open, reproducible practices, they are actively contributing to enhancing the reproducibility of psychological research, and to establishing a culture of “getting it right” (Nosek et al., 2012; see Chapter 2). Though adhering to these suggestions may require some adaptation of current practices by the individual researcher, we believe that the steps are minor, and that the benefits will far outweigh the costs. Good practices may be rewarded with general recognition, badges (https://osf.io/tvyxz; Kidwell et al., 2016), and enhanced reputation, but ultimately the reward will be the satisfaction of having contributed to a cumulative science via reproducible findings.

Endnotes 1 Alexander A. Aarts, Frank A. Bosco, Katherine S. Button, Joshua Carp, Susann Fiedler, James G. Field, Roger Giner‐Sorolla, Hans IJzerman, Melissa Lewis, Marcus Munafò, Brian A. Nosek, Jason M. Prenoveau, and Jeffrey R. Spies. 2 Even outside of the dominant NHST model, the basic concept of higher power still holds in a straightforward way – increase the precision of effect estimates with larger samples and more sensitive and reliable methods.

References American Psychological Association (APA). (2010). Publication Manual of the American Psychological Association, Sixth Edition. Washington, DC: Author. Austin, M. A., Hair, M. S., & Fullerton, S. M. (2012). Research guidelines in the era of large‐ scale collaborations: An analysis of Genome‐wide Association Study Consortia. American Journal of Epidemiology, 175, 962–969. doi: 10.1093/aje/kwr441 Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543–554. doi: 10.1177/1745691612459060 Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71, 230–244. doi:10.1037/0022‐3514.71.2.230 Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483, 531–533. doi: 10.1038/483531a Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F., Geller, J., Giner‐Sorolla, R., … Van’t Veer, A. (2014). The replication recipe: What makes for a convincing replication? Journal of Experimental Social Psychology, 50, 217–224. doi: 10.1016/j. jesp.2013.10.005 Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafo, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376. doi: 10.1038/nrn3475

Maximizing the Reproducibility of Your Research

19

Carp, J. (2012a). The secret lives of experiments: Methods reporting in the fMRI literature. NeuroImage, 63(1), 289–300. doi:10.1016/j.neuroimage.2012.07.004 Carp, J. (2012b). On the plurality of (methodological) worlds: Estimating the analytic f lexibility of fMRI experiments. Frontiers in Neuroscience, 6, 149. doi: 10.3389/ fnins.2012.00149 Cohen, J. (1962). The statistical power of abnormal‐social psychological research: A review. Journal of Abnormal Social Psychology, 65, 145–153. doi: 10.1037/h0045186 Dalton, D. R., Aguinis, H., Dalton, C. M., Bosco, F. A., & Pierce, C. A. (2012). Revisiting the file drawer problem in meta‐analysis: An assessment of published and nonpublished correlation matrices. Personnel Psychology, 65, 221–249. doi: 10.1111/j.1744‐6570.2012.01243.x De Angelis, C., Drazen, J. M., Frizelle, F. A., Haug, C., Hoey, J., Horton, R., … Van der Weyden, M. B. (2004). Clinical trial registration: A statement from the international committee of medical journal editors. Lancet, 364, 911–912. Eich, E. (2013). Business not as usual. Psychological Science. Advance online publication. doi: 10.1177/0956797613512465 Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives on Psychological Science, 7, 600–604. doi: 10.1177/1745691612460686 Fuchs, H. M., Jenny, M. A., & Fiedler, S. (2012). Psychologists are open to change, yet wary of rules. Perspectives on Psychological Science, 7, 639–642. doi:10.1177/1745691612459521 Gawande, A. (2009). The checklist manifesto. New York, NY: Metropolitan Books. Grahe, J., Brandt, M. J., IJzerman, H., & Cohoon, J. (2014). Collaborative Replications and Education Project (CREP). Retrieved from Open Science Framework, http://osf.io/wfc6u Grahe, J. E., Reifman, A., Hermann, A. D., Walker, M., Oleson, K. C., Nario‐Redmond, M., & Wiebe, R. P. (2012). Harnessing the undiscovered resource of student research projects. Perspectives on Psychological Science, 7, 600–604. doi: 10.1177/1745691612459057 Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20. doi: 10.1037/h0076157 Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. doi: 10.1371/journal.pmed.0020124 Jaeger, R. G., & Halliday, T. R. (1998). On confirmatory versus exploratory research. Herpetologica, 54(Suppl), 564–566. John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. doi: 10.1177/0956797611430953 Kashy, D. A., Donnellan, M. B., Ackerman, R. A., & Russell, D. W. (2009). Reporting and interpreting research in PSPB: Practices, principles, and pragmatics. Personality and Social Psychology Bulletin, 35(9), 1131–1142. doi: 10.1177/0146167208331253 Kemerer, C. F., & Paulk, M. C. (2009). The impact of design and code reviews on software quality: An empirical study based on PSP data. IEEE Transactions on Software Engineering, 35, 534–550. doi: 10.1109/TSE.2009.27 Kepes, S., & McDaniel, M. A. (2013). How trustworthy is the scientific literature in I‐O psychology? Industrial and Organizational Psychology: Perspectives on Science and Practice, 6, 252–268. doi: 10.1111/iops.12045 Kidwell, M. C., Lazarevic, L. B., Baranski, E., Hardwicke, T. E., Piechowski, S., Falkenberg, L. ‐S., Kennett, C., Slowik, A., Sonnleitner, C., Hess‐Holden, C., Errington, T. M., Fiedler, S., & Nosek, B. A. (2016). Badges to acknowledge open practices: A simple, low‐cost, effective method for increasing transparency. PLOS Biology, 14, e1002456.

20

Open Science Collaboration

Kilkenny, C., Browne, W. J., Cuthill, I. C., Emerson, M., & Altman, D. G. (2010). Improving bioscience research reporting: The ARRIVE guidelines for reporting animal research. PLoS Biology, 8(6), e1000412. doi 10.1371/journal.pbio.1000412 Klein, O., Doyen, S., Leys, C., Magalhäes de Saldanha da Gama, P. A., Miller, S., Questienne, L., & Cleeremans, A. (2012). Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments. Perspectives on Psychological Science, 7(6), 572–584. doi: 10.1177/1745691612463704 Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, S., Bernstein, M. J., … Nosek, B. A. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45, 142–152. Kolawa, A., & Huizinga, D. (2007). Automated defect prevention: Best practices in software management. New York, NY: Wiley‐IEEE Computer Society Press. Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108, 480–498. doi: 10.1037/0033‐2909.108.3.480 LeBel, E. P., Borsboom, D., Giner‐Sorolla, R., Hasselman, F., Peters, K. R., Ratliff, K. A., & Smith, C. T. (2013). PsychDisclosure.org: Grassroots support for reforming reporting standards in psychology. Perspectives on Psychological Science, 8(4), 424–432. doi: 10.1177/1745691613491437 Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in psychology research: How often do they really occur? Perspectives on Psychological Science, 7, 537–542. doi: 10.1177/1745691612460688 Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195–244. Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., … Van der Laan, M. (2014). Promoting transparency in social science research. Science, 343, 30–31. doi: 10.1126/science.1245317 Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Gøtsche P. C., Devereaux, P. J., … Altman, D. G. (2010). CONSORT 2010 explanation and elaboration: Updated guidelines for reporting parallel group randomized trials. Journal of Clinical Epidemiology, 63(8), e1‐37. doi: 10.1016/j.jclinepi.2010.03.004 Nosek, B. A. (2014). Improving my lab, my science with the open science framework. APS Observer, 27(3). Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., Ishiyama, J., Karlan, D., Kraut, A., Lupia, A., Mabry, P., Madon, T. A., Malhotra, N., Mayo‐Wilson, E., McNutt, M., Miguel, E., Levy Paluck, E., Simonsohn, U., Soderberg, C., Spellman, B. A., Turitto, J., VandenBos, G., Vazire, S., Wagenmakers, E. J., Wilson, R., & Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1422–1425. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615–631. doi: 10.1177/1745691612459058 Open Science Collaboration. (2012a). An open, large‐scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657–660. doi: 10.1177/1745691612462588

Maximizing the Reproducibility of Your Research

21

Open Science Collaboration. (2012b). The reproducibility project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series) (pp. 299–323). New York, NY: Taylor & Francis. Open Science Collaboration. (2012c). Checklists for Research Workflow. Retrieved from osf. io/mv8pj. Open Science Collaboration. (2014). The reproducibility project: A model of large‐scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leish, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series) (pp. 299–323). New York, NY: Taylor & Francis. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Pashler, H., & Wagenmakers, E. ‐J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530. doi: 10.1177/1745691612465253 Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. PeerJ, 1, e175. doi: 10.7717/peerj.175 Poldrack, R. A., Fletcher, P. C., Henson, R. N., Worsley, K. J., Brett, M., & Nichols, T. E. (2008). Guidelines for reporting an fMRI study. NeuroImage, 40(2), 409–414. doi: 10.1016/j. neuroimage.2007.11.048 Popper, K. R. (1959). The logic of scientific discovery. London: Hutchinson. Roediger, H. L. (2012). Psychology’s woes and a partial cure: The value of replication. APS Observer, 25, 27–29. Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638–641. doi:10.1037/0033‐2909.86.3.638 Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105(2), 309–316. doi: 10.1037/0033‐2909.105.2.309 Simmons, J., Nelson, L., & Simonsohn, U. (2011). False‐positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as “significant.” Psychological Science, 22, 1359–1366. doi: 10.1177/0956797611417632 Stebbins, R. A. (2001). Exploratory research in the social sciences. Thousand Oaks, CA: Sage. doi: 10.4135/978141298424 Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison‐Wesley. Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., … Rennison, D. J. (2013). The availability of research data declines rapidly with article age. Current Biology, 24(1), 94–97. doi: 10.1016/j.cub.2013.11.014 Wagenmakers, E. ‐J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632–638. doi: 10.1177/1745691612463078 Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726–728. doi: 10.1037/0003‐066X.61.7.726

2

Powering Reproducible Research Katherine S. Button and Marcus R. Munafò

Background The widely used null hypothesis significance testing (NHST) framework grew out of the distinct statistical theories of Fisher (Fisher, 1955), and Neyman and Pearson (Rucci & Tweney, 1980). From Fisher, we take the concept of null hypothesis testing, and from Neyman‐Pearson the concepts of Type I (α) and Type II error (β) (see Chapter 3). Power is a concept arising from Neyman‐Pearson theory, and reflects the likelihood of correctly rejecting the null hypothesis (i.e., 1 − β). However, the hybrid statistical theory typically used leans most heavily on Fisher’s concept of null hypothesis testing (Vankov, Bowers, & Munafo, 2014). Sedlmeier and Gigerenzer (Sedlmeier & Gigerenzer, 1989) argued that a lack of understanding of these critical distinctions partly explained the lack of consideration of statistical power in psychological science; while we (nominally, at least) adhere to a 5% Type I error rate, we in theory accept a Type II error rate of 20% (i.e., 80% power). More importantly, in practice, we seem to pay little attention to the Type II error rate, despite the need to consider both when evaluating whether a research finding is likely to be true (Button et al., 2013a, 2013b).

False Positives and False Negatives As we have seen, within the hybrid NHST framework that continues to dominate within psychological science, the power of a statistical test is the probability that the test will correctly reject the null hypothesis when the null hypothesis is genuinely false (i.e., the probability of not committing a Type II error). Therefore, as statistical Psychological Science Under Scrutiny: Recent Challenges and Proposed Solutions, First Edition. Edited by Scott O. Lilienfeld and Irwin D. Waldman. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc.

Powering Reproducible Research

23

power increases, the probability of committing a Type II error decreases. The probability of a committing a Type II error is the false negative rate (β), and the power is equal to 1 − β. The power of a test is also known as the sensitivity. Critically, the power of a single statistical test is mathematically coupled to the size of effect under investigation, the size of the sample, and the significance level (α). We can increase statistical power by increasing the sample size, increasing the effect size, or using a less stringent significance level. Unfortunately, statistical power often receives less consideration than the significance level. To some extent, this is reflected in the unequal weight typically placed on acceptable false positive (Type I) and false negative (Type II) rates; it is conventional (in theory, if not always in practice) to set the Type I error rate to 5% but the Type II error rate to 20%. This suggests that researchers may be more concerned about committing a Type I error (i.e., claiming that an effect exists when it does not) than about committing a Type II error. In fact, while researchers often adhere to a 5% Type I error rate, they often pay little attention to statistical power, and thus the Type II error rate may be much higher than the 20% level considered conventional. This is supported by evidence that, in many diverse fields, the average statistical power may be as low as 20% (i.e., a Type II error rate of 80%) (Button et al., 2013b). This framework appears to imply that changing power only influences the rate of false negative findings and has no impact on the change of false positive results. However, when we consider false positive rates across a population of studies (i.e., an entire research literature), it becomes clear that the likelihood that a statistically significant finding is true depends critically on the average statistical power of studies within that literature. This point is best illustrated with an example. Suppose we work in a field where only 10% of the hypotheses we test are true, so that, in 90% of our studies, the null hypothesis is true. Let us also assume that the Type I and Type II error rates are set at conventional levels; in other words, the significance level is 5%, and the average power of the studies is 80%. If we conduct 1,000 studies with these parameters, of the 100 true associations that exist we will be able to detect 80 (as our power is 80%); of the remaining 900 null‐associations, we will falsely declare 45 (5%) as significant. Therefore, only 80/(80 45) 64% of the findings that achieve statistical significance will actually reflect true effects. Now, suppose that the average power of our 1,000 studies is only 20%. We will likely detect 20 of the 100 true‐associations, but we will still falsely declare 45 (5%) of the remaining 900 null‐associations as significant. In this example, the likelihood that any single significant result is in fact true is only 20/(20 45) 31%. The proportion of positive results produced by a statistical test that is truly positive is formally referred to as the positive predictive value (PPV) of that test, that is, the post‐study probability that a research finding that has been claimed (e.g., based on achieving formal statistical significance) is in fact true (Ioannidis, 2005). As the preceding example illustrates, it is determined by the prevalence of true effects / hypotheses in the research field, and the sensitivity or power of the statistical test. As illustrated in the preceding text, if the pre‐test prevalence of true effects in a

24

Katherine S. Button and Marcus R. Munafò

research field is low, and the power of the test is also low, then the PPV will be low. The formula linking the positive predictive value to power is:

PPV

1

R

1

R

Here, (1 − β) is the power, β is the Type II error, α is the Type I error, and R is the pre‐study odds (i.e., the odds that a probed effect is indeed non‐null among the effects being probed). In the preceding example, the pre‐study odds was 1/9 (i.e., 10 true‐associations for every 90 null‐associations) (Ioannidis, 2005). In summary, low statistical power reduces the chance that an individual study has to detect the effect of interest (i.e., increased Type II errors). However, perhaps less intuitively, when considering the research literature as a whole, a reliance on studies with low power greatly inflates the chances that any statistically significant finding is in fact false, in particular when the hypotheses being tested are unlikely (i.e., in exploratory research). In other words, most of the positive results from underpowered studies may be spurious (see Chapters 1 and 3).

Type S and Type M Errors An inherent limitation of the NHST framework is that, in reality, no effects are truly null. Therefore, with a sufficiently large sample, it will inevitably be possible to reject the null hypothesis even when the difference is trivially small and therefore of no theoretical or practical importance. Therefore, it has been suggested (Gelman & Tuerlinckx, 2000) that a more useful conceptualization of the possible errors that can occur when interpreting data is to consider whether an estimate of an effect differs in magnitude from the true effect, or differs in sign from the true effect (rather than focusing on Type I and Type II errors). The former is a Type M error, and the latter a Type S error. Small studies with low power can, by definition, only reliably detect large effects. So even when an underpowered study discovers a true effect, it is likely that the estimate of the magnitude of that effect provided by that study will be exaggerated. This effect inflation is often referred to as “winner’s curse” (Young, Ioannidis, & Al‐Ubaydi, 2008), and is likely to occur whenever claims of discovery are based on thresholds of statistical significance (e.g., p 0, or HA: μ1 − μ2 0” or “ 0). Or, this theory may predict that an instructed learning group makes, on average, fewer mistakes (HA: μ1 − μ2 0. Now, let us assume that we would draw random samples from the populations of interest (theoretically an infinite number of times) and conduct a significance test for each of these trials. If we then count the number of significant results for a given sig nificance level α (e.g., 5%), we can draw either a correct or an incorrect conclusion, depending on whether the H0 or HA holds in the population. We then can distinguish between different errors. A Type I error occurs if the H0 is true (μ1 − μ2 = 0), but it is rejected based on the basis of the significance test. The related error probability is denoted as α. The probability of correctly accepting the H 0 is then 1 − α (i.e., inverse probability). A Type II error occurs when we fail to reject the H0, although it is not true (μ1 − μ2 > 0). Its related error probability is denoted β. The probability of correctly rejecting the H0 under the assumption the alternative hypothesis HA is true is then 1 − β. Table 3.1 illustrates the possible errors and their associated probabilities. Now, let us turn again to the issue of statistical power, which is the main focus of this section. Statistical power depends on three things: the effect size in the population, the sample size (technically, the standard error, which is, besides the variance of the outcome variable in the population, a function of sample size), and the chosen sig nificance level. Interested readers may refer to statistics textbooks that give more Table 3.1 Probabilities of correct and incorrect decisions. Reality Statistical decision based on the significance test

Reject H0 Accept H0

H0 is true Incorrect decision: Type I error probability: α Correct decision: 1 − α

H0 is false Correct decision: 1 − β Incorrect decision: Type II error probability: β

Psychological Science’s Aversion to the Null

39

detailed explanations of the dependency of power on these parameters. In general, the larger the population effect size, or the larger the sample size, or the higher the significance level, the greater the statistical power. For further clarification, consider the case of the one‐sided t‐test for independent samples, assuming equal but unknown population variances. The power is then defined as the probability to observe mean differences that are greater than the critical value WC when the expected values of the populations 1 and 2 differ in the population. Or, more formally expressed: power = P (T > WC |μ1 ≠ μ2). Referring to the example of a one‐sided independent t‐test (H0: μ1 = μ2 vs. HA: μ1 > μ2) and assuming equal, but unknown population variances in populations 1 and 2, the power is then defined as:

power

p

1 pooled

2

1 n1

1

1 n2

2 diff

Wc

with μ1: population mean of group 1, μ2: population mean of group 2, pooled

2 1

n1 1

2 2

n1 n2 2

n2 1

,

σdiff: standard error of the mean difference, n1 and n2: sample sizes of group 1 and 2, respectively. For the sake of convenience and similarity to applied research, let us assume that a researcher has pre‐experimentally fixed a critical value wc, that is, set the signifi cance level α to 5% and defined a specific hypothesis about the population mean difference (called delta, δ) in a standardized learning test under the alternative hy 0.6. He or she thus states that the population mean difference pothesis H A : 1 2 is 0.6 standard deviations above the population mean difference under the null 0 . Statistical power then depends on two parameters: the hypothesis: H 0 : 1 2 difference between the population means μ1 and μ2 and the sample sizes, which influences the standard error of the mean difference σdiff, denoting the “typical” amount by which sample mean differences deviate from the population mean difference. Both properties are also illustrated in Figure 3.1, showing the sampling distributions of the mean differences under H0 and HA for the illustrative case of two populations differing in their mean scores on the hypothetical standardized learning test: The area under the left curve to the right of the critical value, labeled as α, refers to the Type I error probability (i.e., 5% in this example). The area under the right curve to the left of the critical value denotes the Type II error probability, labeled as β – that is, the probability of falsely not rejecting the null hypothesis when in fact the

Moritz Heene and Christopher J. Ferguson

40 (a)

wc = 0.6 (μ1–μ2)a = 0.7

(μ1–μ2)0 = 0

Density

1.5

1.0

0.5

1–β = 0.61 β = 0.39 α = 0.05

0.0 –1.0

–0.5

0.0

0.5 w = x1–x2

1.0

1.5

(b) wc = 0.6

(μ1–μ2)0 = 0

(μ1–μ2)a = 1

Density

1.5

1.0

0.5

1–β = 0.87 β = 0.13 α = 0.05

0.0 –1.0

–0.5

0.0

0.5 w = x1–x2

1.0

1.5

(c) wc = 0.38 (μ1–μ2)a = 0.7 (μ1–μ2)0 = 0

Density

1.5

1.0

0.5

1–β = 0.92 β = 0.08

α = 0.05

0.0 –1.0

–0.5

0.0

0.5 w = x1–x2

1.0

1.5

Figure 3.1 Illustration of parameters affecting power. (a) Power for population effect size δ = 0.7 and sample sizes of n1 = n2 = 20. (b) Effect of greater population effect size (δ = 1) on power. (c) Effect of increasing sample size to n1 = n2 = 50 on power.

Psychological Science’s Aversion to the Null

41

alternative hypothesis is true. Hence, 1 − β, the reverse probability, then defines the light-gray area under the density curve of the sampling distribution under H1. This area under the curve defines the power, that is, the probability of correctly rejecting the null hypothesis (H0: μA = μB) when the specific alternative (H1: μA > μB) is true. As illustrated in Figure 3.1a, for a population effect size of δ = 0.7, sample sizes of n1 = n2 = 20, and a significance level of 0.05 (corresponding to a critical mean difference of wc = 0.60), the power is 0.61, and thus not enough to reliably reject the null hypothesis. Now, how can we improve the situation, that is, how can we increase power? As we can see from Figure 3.1b, a greater mean difference of δ = 1 in the population implies a higher power, resulting in an increase in power of 0.87. In fact, while this part of the illustration is of theoretical interest, it is far‐fetched from a practical point of view, just because we cannot manipulate population effect sizes. The only feasible method to increase power is to increase sample sizes. This example is shown in Figure 3.1c. By increasing the sample size from n1 = n2 = 20 to 50, the standard errors (i.e., the related standard deviations of both sampling distributions) decrease, and, thus, the light-gray area under the H1 distribution increases, thereby increasing power to 0.92. One should also notice that the reduced standard error of the mean difference also implies an increased precision of population mean difference estimates coming from different samples, as Figure 3.1c also illustrates. Note that these estimates are now grouped closer around the population mean difference of 0.70 than those in Figure 3.1a. It is thus less likely to observe mean dif ferences that are far off the true population mean difference of 0.7, as is the case in Figure 3.1a. Thus, increasing sample size results in two desirable properties. On the one hand, power is increased, and the trustworthiness of the decision to reject or not reject the null hypothesis based on a significance test result is increased. On the other hand, the precision of population parameter estimates is increased because the standard error is reduced. One of the troubles with significance testing occurs when one tries to interpret a significant result from a study associated with low statistical power. Let us take the case of a researcher finding a significant mean difference (p xt (i.e., “increasing order of magnitude”). This preference for highly restrictive, over‐determined rules (as illustrated in the inner circles of Figure 4.3) and the associated problem of over‐fitting (Wherry, 1975) are typical of scientific hypothesis testing. The specific model xt+1 = 2 + xt seems to be more representative

xt + 1 = 2 + xt Mortality salience

xt + 1 ∈ {increasing N} Salience of existential topic

xt + 1 ∈ {N} Reminder of uncontrolled forces

xt + 1 ∈ {any number} Incompleteness

xt + 1 ∈ {any symbol} Mere priming of incompleteness

Figure 4.3 Impressive hits concerning a specific rule (inner circle) can prevent researchers from discovering that the data reflect a more general rule (outer circles).

False Negatives

61

(Kahneman & Tversky, 1972) of the 2–4–6 data than the less specific model xt+1 > xt or even more abstract models such as x ∈{N} (all natural numbers) or x ∈{any symbol} in the outer circles of Figure 4.3. The closer fit of over‐specified models raises a sense of precision. The specific model seems to account for more features of the data set than the abstract model. However, no attempt is typically made to falsify the claim that a constant quantitative increment of two is necessary to fulfill the rule. Hits (and hidden false negatives) resulting from over‐specified hypotheses Theoretical fallacies such as this should not be treated lightly. The over‐specification tendency with which Wason (1960) was concerned is not just of historical significance. It can be found in many areas of behavioral science. For instance, with reference to terror management theory (Greenberg, Solomon, & Pyszczynski, 1997), Fiedler, Kutzner, and Krueger (2012) discussed the basic hypothesis that mortality salience (MS) causes a shift toward more conservative worldviews. The conservatism shift is thus attributed to a need for denegation of one’s own death by devotion to a form of culturally arranged immortality. Again, the assumption that there is something unique about mortality‐related stimuli sounds intuitively agreeable; ample evidence seems to support this focal hypothesis. Numerous studies demonstrated conservative reactions to diverse manipulations of mortality salience, suggesting a clear‐cut “hit.” However, does the primed concern with mortality really cause the conservative shift? Again, from the available hits in theory testing, one can conclude only that mortality salience is a sufficient condition. One cannot be sure that it is a necessary condition, unless one has ruled out a number of other less restrictive and simpler rules that include the mortality‐salience rule as a special case. Analogous to the hierarchy of decreasingly specific rules in the Wason (1960) task, mortality is a special case of existential values, or a special case of lack of power relative to nature, which is a special case of powerlessness in general, which is a special case of incompleteness (see gray‐colored text in Figure 4.3). To conclude that the conservative shift reflects the causal impact of mortality salience per se, it is necessary to exclude that these increasingly more general rules, which have nothing to do with mortality, can also account for conservatism (cf. Fiedler et al., 2012). In fact, Wicklund and Braun (1987) found that incompleteness priming alone (e.g., reminding people of their novice status or of an incomplete task) can induce a similar shift toward conservative values as does mortality. Again, rather than merely testing the sufficiency of the mortality‐salience hypothesis in many studies meant to establish a hit based on high statistical power, it is essential to also test for the necessity of mortality salience as a causal ingredient. Less specific or restrictive rules (e.g., any kind of incompleteness) may provide more parsimonious and more appropriate accounts. To be sure, the very manipulations that have typically been used to induce mortality salience – such as exposure to a funeral or writing about one’s own mortality – do not exclusively prime mortality. They can also be reframed as manipulations of other causal factors, such as incompleteness. The control conditions that have been used to set mortality salience apart from other negative affect – such as a scenario of going to a dentist – can hardly rule out

62

Klaus Fiedler and Malte Schott

all these alternative explanations. Other possibilities have to be tested thoroughly, as recognized by proponents of terror‐management theory (e.g., Greenberg et al., 1995). For instance, Echebarria‐Echabe (2013) recently tested an alternative account in terms of the broad concept of uncertainty. The variety of theoretical false negatives (unrecognized alternative hypotheses) behind the hit obtained in statistical tests of mortality‐salience can be very high. In any case, a hit in statistical tests of one focal hypothesis can imply a false positive at a more comprehensive theoretical level, along with one or more false negatives inherent in untested alternative hypotheses. Ambiguity of negative test results An analogous point can be made for the allegedly unproblematic case of correct rejections. The idea of a correct rejection is that a negative test result (i.e., a failure to find support for H1) can be interpreted as a correct decision D+, implying that H0 is true. The beauty of such a rejection decision, provided that it is warranted, lies in its power to cross out from the list of possibilities an idea that apparently did not work. However, lack of support for a statistical hypothesis does not necessarily mean that the underlying theoretical hypothesis is wrong. It may alternatively reflect an inappropriate or sometimes even clumsy way of operationalizing and testing the theoretical hypothesis. This basic ambiguity of negative test outcomes places a heavy burden on the interpretation of correct rejections. Their (alleged) correctness depends on the degree to which a study meets the criteria of good science (viz., objectivity, reliability, validity, and effective research design). If these criteria are not met, the potential value of correct rejections is foregone. A good example of a premature rejection of an allegedly wrong hypothesis can be found in recent literature on the implicit‐association test (IAT; see Chapter 10). The failure to show that participants voluntarily control their IAT scores in a few studies (Banse, Seise, & Zerbes, 2001; Kim, 2003) led to the conclusion that faking on the IAT is impossible. However, closer inspection of the instructions used for these studies reveals that the faking task was not properly communicated to the participants. Kim1 (2003) merely instructed participants to avoid the impression of being prejudiced, and although Banse et al. (2001) provided explicit instructions to fake a positive attitude toward homosexuals, they did not provide any information about how response latencies affect IAT results. One decade later, what appeared to represent a correct rejection has turned into clear‐cut false negative. Hardly anybody would contest the assertion that even minimal prior experience with the IAT and how it works is sufficient to enable participants to control and even reverse their IAT scores (Fiedler & Bluemke, 2005; Röhner, Schröder‐Abé, & Schütz, 2011). “Logically, null findings in a few studies cannot prove the impossibility of manipulating IAT scores substantially, however, the mere existence of successful faking under some conditions is sufficient to prove its possibility” (Fiedler & Bluemke, 2005, p. 308). The lesson learned here is straightforward. A correct rejection, based on any statistical threshold of acceptance, can by definition only be as good as the operationalization of the studied variables. A failure to manipulate an independent

False Negatives

63

variable effectively, just as relying on an invalid or unreliable measure of the dependent variable, necessarily reduces the validity of a correct rejection. We have to keep this principle in mind and refrain from discarding theoretical hypotheses when negative results are peculiar to questionable manipulations or measurement procedures. The strength of a correct rejection is thus contingent on the proper operationalization of psychological constructs. Otherwise, a wrongly rejected hypothesis may hinder scientific progress.1

Seemingly unwanted outcomes: false positives and false negatives While hits and correct rejections can mislead researchers into resting on their l aurels and refraining from more systematic tests of sufficient and necessary conditions, the seemingly unwanted cases of false positives and false negatives are also ambivalent. They come along with both costs and benefits. We have already seen that empirical outcomes are not only hits or only correct rejections. What constitutes a hit with reference to a statistical hypothesis may represent a false positive concerning the underlying theoretical hypothesis and a false negative with respect to another causal factor that actually underlies the data. By analogy, what appears to be an erroneous outcome – a false positive or a false negative – may at the same time instigate new research, leading to more refined theory tests and new insights. Yet, although both types of erroneous outcomes can foster scientific progress, it seems important to note a fundamental source of asymmetry between false positives and false negatives. There are good reasons to assume that false positives are, in some respects, less costly and less irreversible than false negatives. How consequential are false positives and false negatives? One notable asset of false positives, apart from their major deficit of being mistakes, is that they tend to be published and to make an error transparent. In contrast, the greatest disadvantage and the most dangerous property of false negatives is that they go unpublished and unrecognized, disappearing in the file‐drawer. Whereas false positives are visible mistakes that can be tackled and corrected in upfront and transparent debates that instigate scientific progress, false negatives are much less likely to be detected, corrected, and harnessed for scientific progress (Denrell, 2005, Denrell & Le Mens, 2012). Statistically, a false positive is no more and no less than an outlier in a sampling distribution. Given an error probability of α, a study sample exhibits a significant effect that does not exist in the underlying population at a rate of α. However, statistical α errors per se tell us little about the costs and benefits of evidence‐based decisions. There is little reason to assume that committing or publishing a false positive in a study is particularly expensive or irresponsible (cf. Simmons, Nelson, & Simonsohn, 2011). Whether it is wise to set α to a more conservative level in applied research (e.g., in medicine) depends on the relative costs and benefits of the decisions informed by false positives and false negatives. False positives typically lead to non‐beneficial or costly interventions, whereas false negatives lead to failures to

64

Klaus Fiedler and Malte Schott

recognize significant side effects of medical or pharmacological interventions. Which one of both unwanted outcomes is worse cannot be derived from inherent differences of false positives or false negatives, but depends on a rational analysis of the relative costs and benefits. Importantly, such a cost–benefit analysis can only be meaningfully applied to theoretical inferences and practical interventions, and not to statistical samples observed in specific studies. Vicissitudes of distinguishing wrong and valid findings However, it can be very difficult to determine the utility of practical decisions, especially when the categorization of a decision as correct or false is scientifically unclear. Let us again relate this important issue to a prominent research topic. Imagine that two new studies in economic psychology have demonstrated that investment bankers fall prey to a generalized overconfidence illusion (Gigerenzer, Hoffrage, & Kleinbölting, 1991) – a phenomenon covered essentially in all textbooks in decision research, economics, social psychology, and consumer science. When presented with a list of binary knowledge questions (e.g., “Which one of two stocks has been more successful in 2013, X or Y?”), the actual percentage of correct responses is lower than the corresponding subjective‐confidence estimates in percentage. Imagine that one study, working with tricky decision items from the global economy, yields a significant overconfidence bias in a sufficiently large sample of, say, 200 bankers. Imagine that the other study, based on a random sample drawn from a universe of all pairwise items in a geographic knowledge domain, finds much weaker evidence for overconfidence. In a sample of 100 bankers, the illusion was only significant at a moderate level of α = 0.05, rendering the chance of a false positive more likely. Based on these findings, the authors conclude that strong support for bankers’ overconfidence has been found for tricky economic problems – consistent with the possibility that domain‐specific overconfidence may have facilitated the financial crisis. What does such a statistical result reveal about the underlying theoretical and practical issues, the actual probability or disutility of overconfidence illusions? Some statisticians would complain that the α rates ought to be determined a priori, before the results are apparent (without offering rational norms for α setting). Others would explain that one should not pit significant against non‐significant results; one should instead try to test for an interaction, showing that overconfidence was significantly stronger in one study than in the other. Still others, who are concerned with issues of research design, would argue that one should refrain from comparing two studies that used different materials and content domains. However, all these typical comments concerning the validity of an overconfidence result would miss a crucial point, namely that only participants have been treated as a random factor, whereas task items have been selected in a pragmatic fashion. One memorable finding across several decades of pertinent research is that strong and robust overconfidence illusions are obtained only when the selection of items is based on researchers’ intuition rather than on randomized sampling from a clearly determined domain (Gigerenzer et al., 1991; Juslin, 1994). That is, only when the researchers’ own intuitive feeling for “suitable items” is at work is overconfidence

False Negatives

65

obtained regularly. The illusion vanishes, or is greatly reduced, when items are treated as a random factor. One might therefore suspect that the discrepant results of the two depicted studies might be due to the fact that the former study used selectively tricky economy item, whereas the second study used randomly selected items. Consequently, what appeared to be a valid result (hit) may actually be much more likely to reflect a false positive as soon as the statistical notion of random sampling is extended from participants to items. Conversely, the ultimate insight achieved by Gigerenzer et al. (1991) and Juslin (1994) consisted in their overcoming a long‐lasting false negative – namely, the failure to take stimulus sampling procedures into account as a major determinant of overconfidence. Indeed, the classification of a finding as false positive depends on a number of decisions and restrictions that lie outside the statistician’s realm. Should only the selection of participants be randomized? Or, should tasks or items be also randomized? How about instructions? Task settings? Or what about social and cultural contexts and many other facets that ought to be randomized in a truly representative design (Brunswik, 1955; Dhami, Hertwig, & Hoffrage, 2004)? Because it is virtually impossible to realize a fully representative design that treats all these facets as random factors, it follows that one cannot determine an effective α rate. All that statisticians can do is to suggest simplified rules to estimate α in an impoverished design, in which most independent variables are treated as fixed‐effects factors with only two or very few arbitrarily selected levels. Given all this indeterminacy, the air of precision that surrounds the concept of α seems to reflect a cardinal illusion in the current methodological debate. Even very large sample sizes and repeated replications intended to optimize α control seem to not prevent researchers from committing serious errors at the level of theorizing that – once recognized – render perfect α control futile. Pertinent research examples Consider another really enlightening finding in the overconfidence realm, which highlights the importance of long‐ignored false negatives and, at the same time, sheds an embarrassing light on several decades of painstaking research guided by careful statistical hypothesis testing. Erev, Wallsten, and Budescu (1994) found that overconfidence reflects, to a large extent, a truism. Given that subjective confidence and accuracy are less than perfectly correlated, plotting one variable as a function of the other must inevitably cause regression to the mean. Thus, accuracy scores must be regressive when plotted as a function of given values in subjective confidence, as usually done in overconfidence research. Accuracy scores that correspond to high confidence scores must be less extreme. However, when the same imperfect correlation is analyzed the other way around, plotting confidence as a function of given values of accuracy, high accuracy scores now regress to less extreme confidence scores. Thus, Erev et al. (1994) showed that both over‐ and under‐confidence can be found in the same data. The point here is that three decades of studies aimed at establishing an ill‐understood law at an impressive level of significance and effect size can be cast into doubt as soon as a simple logical rule is taken into account. The scientific value of the insight gained

66

Klaus Fiedler and Malte Schott

from Erev et al. (1994) was independent of samples size and α; it could have been discovered in a seriously underpowered study or even with no study at all! The second example refers to evolutionary conceptions, which are readily accepted by the community of behavioral scientists, at times leading to uncritical – though often highly significant – hypothesis tests. In such cases, again, progress in science depends on deliberate attempts to overcome theoretical false negatives, as nicely illustrated in a paper published by Bell and Buchner (2012). With reference to the critical role of cheater‐detection for the evolution of social rationality (Cosmides & Tooby, 2004; Gigerenzer & Hug, 1992), these authors were interested in testing the prediction of enhanced source memory for faces encoded in the context of a cheater‐ detection vignette. Indeed, a first series of experiments provided strong support. The context of faces associated with cheating was memorized much better compared with the context of neutral control faces or faces associated with cooperation. Nevertheless, another study revealed that faces associated with disgust had a similar source memory advantage, suggesting an alternative explanation in terms of a so‐ far‐ignored (theoretical) false negative. Maybe other negative emotions that are unrelated to detecting cheaters might exert a similar benevolent influence on memory. Finally, Bell and Buchner (2012) reported evidence for a reverse source memory advantage for faces associated with positive themes (cooperation) when the base‐rate of cheating was high rather than low, suggesting an even broader alternative account in terms of expectancy violation. Several scholars have shown, impressively, that the history of science can be framed as a progressive sequence of false positive errors (Kuhn, 1962) giving way to new insights growing out of so‐far‐neglected false negatives. Today’s state of the art in science turns out to be a false positive tomorrow, when more advanced models or methods – and discoveries of false negatives – reveal that earlier conclusions must be revised and corrected. As a creative side effect of this continual updating process that turns apparent hits into (theoretical) false positives and false negatives into hits, a plethora of novel and theoretically enlightening findings arise from research that may be guided by unwarranted or merely transitory hypotheses. The failure to understand and adequately quantify the overconfidence illusion did not prevent researchers from making discoveries of considerable practical value. For instance, overconfidence research has revealed that experts may be more prone to overconfidence than laypeople (Oskamp, 1965); overconfidence is sensitive to sample size (Moore & Healy, 2008); and overconfidence effects are particularly strong for interval construction tasks (estimating the lower and upper limits of an interval that includes a correct estimate with a certain probability), but much weaker for interval estimation (estimating the probability corresponding to a given interval; cf. Juslin, Winman, & Hansson, 2007). It is characteristic of science that many valuable incidental outcomes are not contingent on the ultimate truth of the guiding hypotheses. However, in this process, novel insights and “scientific revolutions” (Kuhn, 1962) are rarely obtained by applying stricter criteria for statistical hypothesis testing. They are typically brought about by critical dissenters and members of scientific minorities who dare to investigate those hidden alternative ideas that we have called theoretical false negatives.

False Negatives

67

There are more reasons why the likelihood and importance of false negatives tend to be underestimated, compared to the often overstated importance and alleged costs of false positives. Even at the level of statistical significance testing, there is little evidence for the claim that researchers optimally exploit their researcher degrees of freedom (Simmons et al., 2011), resulting in inflated indices of significance. On the contrary, we would like to point out that behavioral scientists are often remarkably inefficient in optimizing the power of studies. They often run inappropriate overall analyses of variance rather than testing their specific hypotheses at much higher power in theoretically appropriate contrast analyses (Rosenthal, Rosnow, & Rubin, 2000). They often fail to control for and optimize the reliability of their dependent measures (Schmidt, 2010). They often average away the strongest and most interesting results by computing participant‐level indices rather than considering within‐participant results across trials (Gilden, 2001). They hardly consider monotonic transformations when there is no reason to take the given scale of raw data for granted. Last but not least, their cardinal mistake is to run underpowered studies that greatly reduce the chances of obtaining significant findings for potentially valid hypotheses in which false negatives have been identified as a major enemy of efficient science (Cohen, 1992). However, compared with all of these biases against false negatives at the level of statistical significance testing, the biases can be assumed to be much stronger at the theoretical level. Whoever has tried to publish an alternative hypothesis that entails a challenge to an already established hypothesis will presumably agree that the scientific system is strongly biased in favor of mainstream and against theoretical dissenters who want to make a strong point for a so‐far‐undiscovered false negative position (see Chapter 17). The peer‐review process – of journals, selection committees, and funding agencies – will typically give more credence to mainstream positions, placing the burden of proof on proponents of new hypotheses. As a consequence of this conservative bias, the chances are high that a large number of studies are designed to gather strong support and high citation rates in favor of a few consensually preferred hypotheses.2 Due to the high meta‐analytic power of such mainstream positions, even the weakest effects become easily significant, and therefore statistical false positives hardly constitute a severe problem. However, because significant (mainstream) positions identify only sufficient causes but rarely necessary causes, the real problem lies in the neglect of theoretical false negatives. Ironically, Bayesian inference tools (Griffiths, Tenenbaum, & Kemp, 2012) – which have been suggested to improve the logic of science – will further reinforce the bias in favor of the mainstream, because the prior odds will favor the one of two equally correct hypotheses that has been the focus of more empirical tests (Fiedler, Walther, & Nickel, 1999). However, as long as the allegedly stronger support for the more frequently tested hypothesis does not reflect an empirical outcome but a collective bias or fashion within the scientific community, one should refrain from drawing unwarranted inferences on labile Bayesian ground (for a telling example, see Ullrich, Krueger, Brod, & Groschupf, 2013).

68

Klaus Fiedler and Malte Schott

Concluding Remarks The terms “hits,” “false positives,” “correct rejections,” and “false negatives,” which are commonly used to denote the outcomes of scientific hypothesis testing, are borrowed from signal‐detection theory (Swets, 1996). However, neither researchers nor reviewers and editors hardly ever engage in the kind of cost–benefit analysis that this statistical‐decision theory requires. The central message conveyed in the present chapter is that no outcome of hypothesis testing exclusively entails costs or benefits. Behind every hit there are potential false positives, and a hit in a statistical test may well conceal important theoretical false negatives. An unambiguous classification of research findings as hits, correct rejections, false positives, and false negatives is possible only at the superficial level of statistical hypotheses in singular studies. It is uggesting always ambiguous, though, at the level of theoretical hypotheses, s decisions that vary in costs and benefits. Such a consideration of costs and benefits – which should go beyond the value of getting a paper rejected or published, respectively – is essential to the evaluation of any scientifically based action or decision. Throughout this chapter, we have repeatedly distinguished between basic and applied research. Most examples used to illustrate that false negatives can be equally important or even more important than false positives came from basic rather than applied research. However, this does not mean that, in applied domains, false negatives are negligible – for instance, when medical interventions call for a particularly high degree of responsibility. An inappropriate intervention can be motivated not only by false positive evidence on the intervention’s seeming advantage, but also by false negative errors to detect its disadvantage or another intervention’s superiority. Therefore, the distinction of “positive” or “negative” research findings, which may be subject to arbitrary framing, must not be confused with the distinction of committed and omitted interventions – and both of these distinctions are not per se correlated with cost–benefit relations. To be sure, the costs of false positives, or errors of commitment, typically stand alone, pointing to a clearly identifiable mistaken finding that is to blame for a wrong intervention. In contrast, the costs of false negatives are often invisible and indirect, because an overlooked finding cannot be blamed for a wrong action or decision. However, as the goal of responsible applied research is to identify and implement superior interventions, rather than only avoiding mistakes one can be blamed for, the need to critically assess false negatives (i.e., failures to look out for better methods) is in principle as important as the need to control for false positives (erroneous evidence for a seemingly appropriate method). Indeed, the costs of false negatives in applied domains may accumulate and exacerbate over time, when they prevent scientists from discovering more beneficial and less costly alternatives. The entire debate about the quality and usability of behavioral science is of little value, and is actually counterproductive if the utilitarian analysis of costs and benefits is confounded with the statistical rhetoric of false positives and false negatives.

False Negatives

69

A glance at the history of science also reveals that false negatives often constitute a similarly serious problem as false positives. Although in applied research, especially in medicine, false positives can have disastrous consequences, it is easier in basic research to detect and correct transparently published false positives than false negatives that were never noticed or vanished in the file drawer. Scientific progress often consists of the discovery that seeming hits are false positives, reflecting only transitory solutions, and new insights are gained from the discovery of precious false negatives. How hard is it to publish a debate, a theoretical argument against intriguing popular findings, or an unfamiliar non‐mainstream theory? And how easy is it, by comparison, to publish even shallow mainstream ideas, provided they adhere to superficial statistical norms? Thus, we believe what is needed for progress in “psychological science under scrutiny” is not so much better statistics aiming at alpha‐control but deeper and stricter theorizing.

Acknowledgment The research underlying this chapter was supported by a Koselleck grant from the Deutsche Forschungsgemeinschaft (Fi 294 / 23‐1).

Endnotes 1 Kim (2003) included another experimental condition. Participants were explicitly instructed to react more slowly on congruent trials and more quickly on incongruent trials to reduce their IAT scores. As the former was easier accomplished than the latter, Kim concluded that they had only been partly able to fake their IAT scores, in a way that, in Kim’s opinion, was likely to be identified by careful examination of the data. 2 Such conservatism might be appropriate, to be sure, if many research findings are false.

References Banse, R., Seise, J., & Zerbes, N. (2001). Implicit attitudes toward homosexuality: Reliability, validity, and controllability of the IAT. Zeitschrift für Experimentelle Psychologie, 48, 145–160. Bell, R., & Buchner, A. (2012). How adaptive is memory for cheaters? Current Directions in Psychological Science, 21(6), 403–408. doi: 10.1177/0963721412458525 Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review, 62, 193–217. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. doi: 10.1037/ 0033‐2909.112.1.155 Cohen, J. (1994). The earth is round (p p(P). In our medical example, this can happen under two common conditions. First, think of the rash inspection as a test of Lyme disease. The validity of a test is expressed by its sensitivity and its specificity. Sensitivity is the probability that a person with Lyme has a rash, p(Q|P); specificity is the probability that a person without Lyme does not have a rash, p(~Q| ~ P). A test is highly valid (and a symptom is a good diagnostic sign) inasmuch as both sensitivity and specificity are high. Yet, if the base rate probability of the disease is low, that is, if p(P) p(P), and the stage for invalid reverse inferences is set. Second, the same inequality occurs if the sensitivity of the test is greater than its specificity. Many symptoms are nonspecific; they can occur for a number of reasons. Yet, many diag nosticians prefer highly sensitive tests because they would rather call a healthy person ill than overlook a disease. This strategy is not outright irrational. It is an example of judicious “error management” (Haselton et al., 2009; Pascal, 1670/1910). In the context of many diseases, physicians assume that the error of missing a sick person is more costly than the error of missing a healthy one. They prefer to err, as it were, on the safe side. The most prominent attempt to improve probabilistic decision‐making and to combat the reverse inference fallacy consists of presenting frequency data instead of probabilities (Gigerenzer, Gaissmaier, Kurz‐Milcke, Schwarz, & Woloshin, 2007). Crossing the state of nature (e.g., the disease is present or not) with the state of the signal (the symptom is present or not; the test result is positive or negative) provides four frequencies. When these frequencies are displayed in a 2 × 2 table, the decision‐ maker can calculate any conditional probability with ease, and advocates of this method can point to successes in the training of professionals (e.g., in the medical field). Thinking in terms of frequencies is easier than thinking in terms of probabil ities. Natural and cultural evolution have favored the former, which maps more closely on to how we encounter events in our typical environments.

Inferences in Science Most sciences distinguish data that allow causal inference from data that do not. As is well known, correlation is not sufficient for causation. If two variables X and Y are correlated with each other, X may be the cause of Y, Y may be the cause of X, or a third variable Z may be the cause of both X and Y. It falls to experiments, in which X is manipulated, to show whether there is a causal effect on Y. The confusion of correlation with causation is a reverse inference problem. In the bivariate case, it is true that if there is causation, there is correlation; but it is not true that if there is correlation, there is causation.4 Psychologists design experiments to make forward causal inferences. They set up hypotheses of the type “if certain experimental conditions P are met, certain effects Q will be observed.” They think of their experiments as modus ponens tests under uncertainty. The typical two‐condition experiment has the same number of partici pants in the treatment and in the control condition (i.e., p[P] = 0.5). Forward and reverse inferences are calculated from the set of four joint frequencies (P and Q, P and ~ Q, ~P and Q, ~P and ~ Q). For rare behaviors (p[Q] p[Q|P]); for common behaviors, they are weaker. Neither experimenters nor their readers care much about reverse inferences within the study sample. They want to infer the causes of behavior when they

Reverse Inference

113

observe that behavior “in the wild.” For example, research shows that individuals with low self‐esteem are more likely to derogate members of out‐groups after their ego is bruised than are individuals whose self‐esteem is high or who are not threat ened (Crocker, Thompson, McGraw, & Ingerman, 1987). This is a valuable forward inference. Notice, however, that the base rate probability of the critical condition is low (p[P] p(Q| ~ P). As more causes are added to the list, esti mates of the overall probability of the effect, p(Q), go up inasmuch as the various causes are independent of one another. As p(Q) becomes larger than any individual p(P) [here we go again], the discrepancy between the forward inference p(Q|P) and the reverse inference p(P|Q) also increases; therefore, the error becomes greater when the reverse inference is thought to mirror the forward inference. This is not a good state of affairs. As many potential causes line up, the effect becomes over determined, which undermines reverse inferences (see Krueger, 1998, for an example from social cognition). Each individual cause is progressively discounted (Kelley, 1972; Krueger, 2009).

Significance testing Although a movement to reform psychological statistics is underway (Cumming, 2014), experimental psychologists still rely on null hypothesis significance testing to support forward inferences at the conceptual level (Krueger, 2001; Lambdin, 2012). They assume that the data (i.e., the particular numerical results they obtain, be they frequencies, percentages, or central tendencies, among others) will be improbable under the null hypothesis if the substantive research hypothesis is true. This concep tualization is paradoxical because, at the statistical level, the logic of significance testing is a probabilistic version of modus tollens (Cohen, 1994). When researchers compute a p-value, they estimate the probability of the observed data – or data even more different from the null hypothesis – under the assumption that the null hypo thesis is in fact true. The null hypothesis refers to the idea that “nothing is going on” (Dawes, 1991), which typically means that behavior does not differ between the treatment and the control conditions. Note that the statistical inference moves from the null hypothesis to the data. Yet, when p is small ( 0.1, this yields “little or no real evidence against the null hypothesis”; 0.05