Advances in Experimental Political Science 9781108478502

Experimental political science has changed. In two short decades, it evolved from an emergent method to an accepted meth

885 45 7MB

English Pages 670 [671] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advances in Experimental Political Science
 9781108478502

Table of contents :
Contents
Tables
Figures
Boxes
Contributors
Acknowledgments
1.A New Era of Experimental Political Science
Part 1: Experimental Designs
2.Conjoint Survey Experiments
3.Audit Studies in Political Science
4.Field Experiments with Survey Outcomes
5.How to Tame Lab-in-the-Field Experiments
6.Natural Experiments
7.Virtual Consent: The Bronze Standard for Experimental Ethics
Part 2: Experimental Data
8.Experiments, Political Elites, and Political Institutions
9.Convenience Samples in Political Science Experiments
10.Experiments Using Social Media Data
11.How to Form Organizational Partnerships to Run Experiments
Part 3: Experimental Treatments and Measures
12.Improving Experimental Treatments in Political Science
13.Beyond Attitudes: Incorporating Measures of Behavior in Survey Experiments
Part 4: Experimental Analysis and Presentation
14. Advances in Experimental Mediation Analysis
15.Subgroup Analysis: Pitfalls, Promise, and Honesty
16.Spillover Effects in Experimental Data
17.Visualize as You Randomize: Design-Based Statistical Graphs for Randomized Experiments
Part 5 Experimental Reliability and Generalizability
18.Transparency in Experimenta lResearch
19.Threats to the Scientific Credibilityof Experiments: Publication Bias and P-Hacking
20.What Can Multi-Method Research Add to Experiments?
21.Generalizing Experimental Results
22.Conducting Experiments in Multiple Contexts
Part 6: Using Experiments to Study Identity
23.Identity Experiments: Design Challenges and Opportunities for Studying Race and Ethnic Politics
24.The Evolution of Experiments on Racial Priming
25.The Evolution of Experiments on Gender in Elections
26.Gender Experiments in Comparative Politics
Part 7: Using Experiments to Study Government Actions
27.Experiments on and with Street-Level Bureaucrats
28.The State of Experimental Research on Corruption Control
29.Experiments on Political Activity Governments Want to Keep Hidden
30.Experiments in Post-Conflict Contexts
31.Experiments on Problems of Climate Change
32.A Constant Obsession with Explanation
Author Index
Subject Index

Citation preview

Advances in Experimental Political Science Experimental political science has changed. In two short decades, it evolved from an emergent method to an accepted method to a primary method. The challenge now is to ensure that experimentalists design sound studies and implement them in ways that illuminate cause and effect. Ethical boundaries must also be respected, results interpreted in a transparent manner, and data and research materials must be shared to ensure others can build on what has been learned. This book explores the application of new designs; the introduction of novel data sources, measurement approaches, and statistical methods; the use of experiments in more substantive domains; and discipline-wide discussions about the robustness, generalizability, and ethics of experiments in political science. By exploring these novel opportunities while also highlighting the concomitant challenges, this volume enables scholars and practitioners to conduct high-quality experiments that will make key contributions to knowledge. James N. Druckman is the Payson S. Wild Professor of Political Science at Northwestern University. He was elected to the American Academy of Arts and Sciences and, with Donald Green, helped found the Experimental Research section of the American Political Science Association. He also is currently the co-Principal Investigator for Time-sharing Experiments in the Social Sciences, and co-authored the book Who Governs? Presidents, Public Opinion, and Manipulation. Donald P. Green is the J.W. Burgess Professor of Political Science at Columbia University. He was elected to the American Academy of Arts and Sciences and, with James Druckman, helped found the Experimental Research section of the American Political Science Association. He also co-founded the scholarly consortium of experimental researchers, Evidence in Governance and Politics, and co-authored the textbook Field Experiments: Design, Analysis, and Interpretation.

Advances in Experimental Political Science

Edited by

JAMES N. DRUCKMAN Northwestern University

DONALD P. GREEN Columbia University

University Printing House, Cambridge cb2 8bs, United Kingdom One Liberty Plaza, 20th Floor, New York, ny 10006, usa 477 Williamstown Road, Port Melbourne, vic 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108478502 doi: 10.1017/9781108777919 © Cambridge University Press 2021 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2021 Printed in the United Kingdom by TJ Books Limited. Padstow, Cornwall A catalogue record for this publication is available from the British Library. Library of Congress Cataloging-in-Publication Data names: Druckman, James N., 1971– editor. | Green, Donald P., 1961– editor. title: Advances in experimental political science / edited by James N. Druckman, Donald P. Green. description: Cambridge, United Kingdom ; New York, NY : Cambridge University Press, 2021. | Includes bibliographical references and index. identifiers: lccn 2020022763 (print) | lccn 2020022764 (ebook) | isbn 9781108478502 (hardback) | isbn 9781108745888 (paperback) | isbn 9781108777919 (epub) subjects: lcsh: Political science–Methodology. | Political science–Research. | Political science–Experiments. classification: lcc ja71 .A388 2021 (print) | lcc ja71 (ebook) | ddc 320.072/4–dc23 LC record available at https://lccn.loc.gov/2020022763 LC ebook record available at https://lccn.loc.gov/2020022764 isbn 978-1-108-47850-2 Hardback isbn 978-1-108-74588-8 Paperback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Contents

List of Tables List of Figures List of Boxes Contributors Acknowledgments 1 A New Era of Experimental Political Science James N. Druckman and Donald P. Green

page viii x xii xiii xv 1

PA RT I : E X P E R I M E N TA L D E S I G N S

2 Conjoint Survey Experiments Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

19

3 Audit Studies in Political Science Daniel M. Butler and Charles Crabtree

42

4 Field Experiments with Survey Outcomes Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

56

5 How to Tame Lab-in-the-Field Experiments Catherine Eckel and Natalia Candelo Londono

79

6 Natural Experiments Rocío Titiunik

103

7 Virtual Consent: The Bronze Standard for Experimental Ethics Dawn Langan Teele

130

v

vi

Contents

P A R T I I : E X P E R I M E N T A L D AT A

8 Experiments, Political Elites, and Political Institutions Christian R. Grose

149

9 Convenience Samples in Political Science Experiments Yanna Krupnikov, H. Hannah Nam, and Hillary Style

165

10 Experiments Using Social Media Data Andrew M. Guess

184

11 How to Form Organizational Partnerships to Run Experiments Adam Seth Levine

199

P A R T I I I : E X P E R I M E N T A L T R E AT M E N T S A N D M E A S U R E S

12 Improving Experimental Treatments in Political Science Diana C. Mutz

219

13 Beyond Attitudes: Incorporating Measures of Behavior in Survey Experiments Erik Peterson, Sean J. Westwood, and Shanto Iyengar

239

P A R T I V: E X P E R I M E N T A L A N A LY S I S A N D P R E S E N T AT I O N

14 Advances in Experimental Mediation Analysis Adam N. Glynn

257

15 Subgroup Analysis: Pitfalls, Promise, and Honesty Marc Ratkovic

271

16 Spillover Effects in Experimental Data Peter M. Aronow, Dean Eckles, Cyrus Samii, and Stephanie Zonszein

289

17 Visualize as You Randomize: Design-Based Statistical Graphs for Randomized Experiments Alexander Coppock

320

P A R T V: E X P E R I M E N T A L R E L I A B I L I T Y A N D G E N E R A L I Z A B I L I T Y

18 Transparency in Experimental Research Cheryl Boudreau 19 Threats to the Scientific Credibility of Experiments: Publication Bias and P-Hacking Neil Malhotra

339

354

20 What Can Multi-Method Research Add to Experiments? Jason Seawright

369

21 Generalizing Experimental Results Erin Hartman

385

22 Conducting Experiments in Multiple Contexts Graeme Blair and Gwyneth McClendon

411

Contents

vii

PA R T V I : U S I N G E X P E R I M E N T S T O S T U DY I D E N T I T Y

23 Identity Experiments: Design Challenges and Opportunities for Studying Race and Ethnic Politics Amber D. Spry

431

24 The Evolution of Experiments on Racial Priming Ali A. Valenzuela and Tyler Reny

447

25 The Evolution of Experiments on Gender in Elections Samara Klar and Elizabeth Schmitt

468

26 Gender Experiments in Comparative Politics Amanda Clayton and Georgia Anderson-Nilsson

485

P A R T V I I : U S I N G E X P E R I M E N T S T O S T U D Y G OV E R N M E N T A C T I O N S

27 Experiments on and with Street-Level Bureaucrats Noah L. Nathan and Ariel White

509

28 The State of Experimental Research on Corruption Control Paul Lagunes and Brigitte Seim

526

29 Experiments on Political Activity Governments Want to Keep Hidden Jennifer Pan

544

30 Experiments in Post-Conflict Contexts Aila M. Matanock

562

31 Experiments on Problems of Climate Change Mary C. McGrath

592

32 A Constant Obsession with Explanation Lynn Vavreck

616

Author Index Subject Index

621 628

Tables

2.1 The list of possible attribute values in the Democratic primary experiment. 2.2 Topical classification of the 124 published articles using conjoint designs identified in our literature review for the years 2014–2019. 4.1 Potential benefits of and complementarities between four methodological practices. 4.2 Notation and values used in the examples. 4.3 Variances and variable costs of alternative designs. 6.1 Typology of randomized experiments and observational studies. 7.1 Three standards of ethics and experiments. 10.1 Effect of exposure to a friend’s tweet. 12.1 Partial counterbalancing in within-subject experimental designs. 15.1 Coverage of 90% uncertainty intervals by subgroup. 15.2 Subgroup effects from audit experiment. 15.3 Split-sample estimates from conjoint analysis. 16.1 Comparing Horvitz–Thompson to Hajek estimators using approximate exposure probabilities. 16.2 Misspecifying exposure conditions. 16.3 Comparing unit to cluster randomization. 18.1 Recommended information for preregistration and pre-analysis plans in experimental research. 18.2 Gerber et al.’s (2015) checklist of reporting items for experimental research. 21.1 Subgroup estimates within the experiment in the CDR example. 21.2 PATE estimation in the CDR example. 21.3 Simulation results using different adjustment sets for estimating the PATE. 21.4 Estimation with common population data types.

viii

page 22 36 65 66 72 117 135 193 235 282 283 285 300 301 305 348 349 398 399 399 401

Tables

22.1 23.1 28.1 28.2

Trade-offs across the three types of multi-context approaches. Racial and ethnic identity experiments. Summary of experimental research on electoral accountability (2008–2015). Summary of experimental research on electoral accountability (2016–2020).

ix

417 441 535 536

Figures

1.1 1.2 2.1 2.2 2.3 2.4 2.5 4.1 4.2 4.3 4.4 5.1 9.1 10.1 12.1 14.1 14.2 15.1 16.1

x

APSR experimental articles by decade. page 2 Experimental Trends. 7 An example conjoint table from the Democratic primary experiment. 21 Outcome variables in the Democratic primary experiment. 23 Average marginal component effects of candidate attributes in the Democratic primary conjoint experiment (forced choice outcome). 30 Average marginal component effects of candidate attributes in the Democratic primary conjoint experiment (rating outcome). 33 Conditional average marginal component effects of candidate attributes across respondent party. 34 Comparing costs of different designs. 59 “The traditional design.” 60 Applying the framework when placebo is not possible: mail example. 73 Example results: variable costs for studying public health intervention in Liberia. 74 Risk preference elicitation. 84 Use of various samples in political science journals. 168 Effect of exposure to Promoted Tweets, direct messages, or both on signing an online petition. 189 Sample “screener” or attention check question. 227 Graphical depiction of direct and indirect (through M) effects of A on Y . 258 Directed acyclic graph depicting the key identification criteria for the single experiment. 264 Subgroup estimates. 282 Causal graph illustrating interference mechanisms and confounding mechanisms when treatment (Z) is randomized. 290

Figures

16.2 Figure from Sandefur (2018) displaying long-term spillover effects from an unconditional cash transfer program, as reported in Haushofer and Shapiro (2018). 16.3 Example of an interference network with 10 units. Each edge (link) represents a possible channel through which spillover effects might transmit. 16.4 From left to right, the distribution of the point estimates for 3000 simulations when the exposure mapping ignores interference, assumes first-degree interference, and assumes second-degree interference, respectively. 16.5 The distribution of point estimates for 3000 simulations given different proportions of missing ties for a case of positive spillover. 16.6 The distribution of point estimates for 3000 simulations with partial interference specified at the level of groups only. 17.1 ATE simulated two-arm trial. 17.2 A simulated block-randomized experiment. 17.3 The same simulated block-randomized experiment. 17.4 A simulated cluster-randomized experiment. 17.5 Covariate adjustment. 17.6 Simulated experiment with a continuous pretreatment covariate. 17.7 Simulated experiment encountering two-sided noncompliance. 17.8 A simulated experiment encountering attrition. 21.1 Simplified example of a field experiment. 24.1 News attention to racial issues. 25.1 Number of women in American Congress over time. 26.1 Total number of gender articles and experimental gender articles in three leading general interest political science journals (American Political Science Review, American Journal of Political Science, and Journal of Politics). 26.2 Total number of gender articles and experimental gender articles in three leading comparative politics journals (World Politics, Comparative Political Studies, and Comparative Politics). 28.1 Risk vs. reward of corruption control experiments. 31.1 Hypothesized relationships structuring the political problem of climate change.

xi

291 295

302 304 312 327 328 329 330 330 331 333 333 388 456 469

487

487 531 593

Boxes

8.1 11.1 11.2 11.3

xii

A guide for scholars using experiments to study political institutions. Steps in an organizational partnership. Helpful relationship-building techniques. What should be put in writing ahead of time?

page 150 205 206 210

Contributors

Georgia Anderson-Nilsson Vanderbilt University

Charles Crabtree Dartmouth College

Peter M. Aronow Yale University

James N. Druckman Northwestern University

Kirk Bansak University of California, San Diego

Catherine Eckel Texas A&M University

Graeme Blair University of California, Los Angeles

Dean Eckles Massachusetts Institute of Technology

Cheryl Boudreau University of California, Davis

Adam N. Glynn Emory University

David E. Broockman University of California, Berkeley

Donald P. Green Columbia University

Daniel M. Butler University of California, San Diego

Christian R. Grose University of Southern California

Amanda Clayton Vanderbilt University

Andrew M. Guess Princeton University

Alexander Coppock Yale University

Jens Hainmueller Stanford University

xiii

xiv

Contributors

Erin Hartman University of California, Los Angeles

Erik Peterson Texas A&M University

Daniel J. Hopkins University of Pennsylvania

Marc Ratkovic Princeton University

Shanto Iyengar Stanford University

Tyler Reny Washington University in St. Louis

Joshua L. Kalla Yale University

Cyrus Samii New York University

Samara Klar University of Arizona Yanna Krupnikov Stony Brook University Paul Lagunes Columbia University Adam Seth Levine Johns Hopkins University Natalia Candelo Londono Queens College, City University of New York Neil Malhotra Stanford University Aila M. Matanock University of California, Berkeley Gwyneth McClendon New York University

Elizabeth Schmitt University of Wisconsin–Platteville Jason Seawright Northwestern University Brigitte Seim University of North Carolina at Chapel Hill Jasjeet S. Sekhon Yale University Amber D. Spry Brandeis University Hillary Style Stony Brook University Dawn Langan Teele University of Pennsylvania Rocío Titiunik Princeton University Ali A. Valenzuela Princeton University

Mary C. McGrath Northwestern University

Lynn Vavreck University of California, Los Angeles

Diana C. Mutz University of Pennsylvania

Sean J. Westwood Dartmouth College

H. Hannah Nam Stony Brook University

Ariel White Massachusetts Institute of Technology

Noah L. Nathan University of Michigan

Teppei Yamamoto Massachusetts Institute of Technology

Jennifer Pan Stanford University

Stephanie Zonszein New York University

Acknowledgments

In 2011, we coedited, along with Jim Kuklinski and Skip Lupia, the Cambridge Handbook of Experimental Political Science. The broad scope of that volume helped convince a skeptical discipline that experiments had arrived in political science. A decade later, experiments have quickly evolved from being an accepted method to being a primary method. The substantive, methodological, and epistemological advances are apparent in every subfield. This volume covers those advances. We are indebted first and foremost to the authors, who not only contributed superb essays, but also served as reviewers for one another. The quality of the chapters reflects both the authors’ command of their subject matter and the many constructive exchanges between the authors. This process of scholarly exchange reflects an extraordinary conference held at Northwestern University on May 21–22, 2019. We thank the generous sponsors of that conference: the National Science Foundation (SES-1822286), the Ford Motor Company Center for Global Citizenship at the Kellogg School of Management at Northwestern University

(directed by David Austen-Smith), and the Department of Political Science in the Weinberg College of Arts and Sciences at Northwestern University. We also thank the Ford Motor Company Center for Global Citizenship and the Institute for Policy Research at Northwestern University (directed by Diane Schanzenbach) for administrative support. We are especially appreciative of the help of Sheila Duran, Cynthia Kendall, Cindy Mydlach, and Patricia Reese. Adam Howat and Andrew Thompson – who were then advanced PhD students at Northwestern – also provided invaluable support. A number of graduate students who attended the conference generously provided comments on drafts of chapters; for that, we thank Robin Bayes, Amanda d’Urso, Daniel Encinas, Sam Gubitz, Katie Harvey, Adam Howat, Suji Kang, Bo Won Kim, Irene Kwon, Jeremy Levy, Ivonne Montes, Matt Nelsen, Jake Rothschild, Richard Shafranek, and Andrew Thompson. We also thank others who attended the conference and provided valuable feedback, including Tabitha Bonilla, Margaret Brower, Maria Carreri, Jean xv

xvi

Acknowledgments

Clipperton, Dan Galvin, Jordan Gans-Morse, Laurel Harbridge-Yong, Lenka Hrbkova, John Lee, Reuel Rogers, and Edoardo Teso. Special thanks go to John Bullock for attending and serving as a discussant at the conference. We thank our superb editor at Cambridge, Robert Dreesen, not only for attending the conference, but also for

his constant support and advice throughout the process. Our hope is this volume spurs another decade, if not more, of important advances in experimental political science. – James N. Druckman and Donald P. Green

CHAPTER 1

A New Era of Experimental Political Science∗

James N. Druckman and Donald P. Green

Abstract Experimental political science has transformed in the last decade. The use of experiments has dramatically increased throughout the discipline, and technological and sociological changes have altered how political scientists use experiments. We chart the transformation of experiments and discuss new challenges that experimentalists face. We then outline how the contributions to this volume will help scholars and practitioners conduct high-quality experiments. Experimental political science has changed. In two short decades, it evolved from an emergent method to an accepted method to a primary method. We are now entering a new era of experimental political science – what can be called “experimental political science 2.0.” We do not use the term “era” lightly. The new era reflects, in part, the expanded use of experiments throughout the discipline. But, more fundamentally, it reflects a radical shift in how social scientists design, analyze, and interpret experiments. For most of social science history, the challenges for experimentalists concerned * We thank Nicolette Alayon, Robin Bayes, Jeremy Levy, Jacob Rothschild, and Andrew Thompson for research assistance. We thank Lynn Vavreck for excellent advice.

obtaining data beyond student subject pools and what to do with null results that typically landed in the “file drawer.” This is no longer true. Data are plentiful, thanks to Internet panels, crowdsourcing platforms, social media, and electronic access to elites; partnerships between researchers and nonacademic entities have also become prolific sources of experimental data. Computing advances have made the implementation and analysis of large-scale studies routine. Moreover, scholars now regularly discuss how to address issues of publication bias, replication, and data-sharing so as to ensure the production of credible experimental research. The challenge now is to ensure that experimentalists design sound studies and 1

2

James N. Druckman and Donald P. Green

implement them in ways that illuminate cause and effect. They must do so while also respecting ethical boundaries, interpreting results in a transparent manner, and sharing data and research materials to ensure others can build on what has been learned. Political science experimentalists, moreover, can capitalize on the widespread acceptance of the method, novel data sources, and evolving epistemological orientations. Making the most of these opportunities requires carefully choosing an appropriate design for a given research question, developing theoretically informative treatments and valid outcome measures, choosing a suitable setting, engaging in sound analyses, cautiously generalizing, and addressing enduring debates. The goal of this volume is to shed light on best practices. In what follows, we first describe the evolution of experiments in political science, focusing on quantitative trends, substantive reach, and institutional progression. This discussion documents a transformation in how political scientists think about and conduct experiments. We then turn to a discussion of recent developments in the social sciences involving technological change and open science, an era we characterize as experimental political science 2.0. This new era is defined by: the application of new designs; the introduction of novel data sources, measurement approaches, and statistical methods; the use of experiments in more areas; and discipline-wide discussions about the robustness, generalizability, and ethics of experiments in political science. This volume explores these new opportunities while also highlighting the concomitant challenges. The goal is to help scholars and practitioners conduct high-quality experiments that make important contributions to knowledge.

1.1 The Evolution of Experiments in Political Science One way to document the evolution of experiments in political science is by counting the number of such articles in general political

80

75

70 60 50 40

31

30

21

20

13

10 0

1

3

7

1950-1959 1960-1969 1970-1979 1980-1989 1990-1999 2000-2009 2010-2019

Figure 1.1 APSR decade.

experimental

articles

by

science journals. We do that by focusing on the discipline’s flagship journal, the American Political Science Review (APSR). We identified all published articles containing experiments from the launch of the journal in 1906 through any such articles posted online in May 2019.1 The first experiment appeared in 1956, 50 years after the journal launched. In Figure 1.1, we plot the number of articles by decade, starting in the 1950s. To be clear, this is not a cumulative count of articles, but rather the specific number by decade. For example, from 2000 through 2009, 31 articles in the APSR used an experimental approach; this number jumped to 75 in the most recent decade. Figure 1.1 supports the claim that experiments moved from being a marginalized method to an accepted method to a central method. Has the recent surge in experimental articles spanned subfields in political science? In 2006, Druckman et al. (p. 627) observed, “To date, the range of application remains narrow, with most experiments pertaining to questions in the subfields of political psychology, electoral politics, and legislative politics. An important question is the extent to which

1 In so doing, we extend the timeline from our prior work (Druckman et al. 2006; also see Rogowski 2015). So as to accommodate how political scientists from varying perspectives define “experiment,” we counted an experiment as a study involving random assignment to conditions or entailing an economic game that applies induced value theory. That said, we assert that “experiment” should only be used when the study employs random assignment (contrary to usage in many economic game studies; see, e.g., Green and Tusicisny 2012).

A New Era of Experimental Political Science

experiments or experiment-inspired research designs can benefit other subfields.” The last decade has answered that question decisively: experiments have become common throughout the discipline. For example, in international relations, there now exists a sizeable experimental literature on “audience costs,” which refers to a process whereby governments publicly threaten to use force to induce a change in opposing countries’ actions. The public nature of such a threat makes it credible, since the opponent recognizes a failure to use force would lead to domestic backlash (e.g., at the voting booth). Experiments show that, indeed, citizens have a distaste for empty threats (e.g., Tomz 2007; although see Kertzer and Brutger 2016). The emergence of experimental research has also been apparent in other international relations domains, such as election monitoring, which has seen dramatic growth in the number and sophistication of randomized evaluations (Buzin et al. 2016; Hyde and Marinov 2014; Ichino and Schündeln 2012). This momentum is especially noteworthy in comparative politics; since 2010, 45% of the experimental articles published in the APSR can be classified in the field of comparative politics (up from 19% during 2000–09 and 2% during 1956–99). Some of these articles fall at the intersection of comparative politics and international relations, as in Beath et al.’s (2013) study of a massive aid program designed to empower Afghan women within the context of a civil war against the Taliban. Others span comparative politics and political psychology, as in Scacco and Warren’s (2018) study of attempts to reduce prejudice between Muslims and Christians in Nigeria. Arguably the largest literature focuses on governance and accountability (see Dunning et al. 2019), typified by studies (e.g., Grossman and Michelitch 2018) that provide voters with job performance scorecards for randomly selected public officials over a series of election cycles. A final example of the reach of experiments concerns studies of whether and how public officials respond to queries from their constituents. In 2011, Butler and Brookman published their correspondence

3

study of state legislators in 44 states. They sent email requests for information about voting registration, varying whether the email came from an ostensibly AfricanAmerican or White constituent who was a Democrat, Republican, or did not mention a party. The binary outcome measure was whether the sender received a reply from the state legislator’s office. This study, which was patterned after correspondence experiments on job market discrimination (Bertrand and Mullainathan 2004; Pager 2003), spawned a literature that, by 2017, included more than 50 audit experiments on the responsiveness of public officials (Costa 2017). It is also part of a growing experimental focus on elites – public officials or political leaders – as subjects (e.g., Grose 2014). It is clear that political scientists think about and apply experiments in a very different way than a decade ago: they think of experimentation as a primary methodology and apply it in novel domains. These trends have both reflected and spurred various institutional innovations. Here, we point to three. First, in 2001, Time-sharing Experiments for the Social Sciences (TESS) was established with support from the National Science Foundation. TESS capitalizes on economies of scale to enable scholars from across the social sciences, on a competitive basis, to conduct survey experiments on probabilitybased samples of the US population (see Mutz 2011). Since its founding, TESS has supported more than 400 experiments. Many of them are published in disciplinary flagship journals, as well as Science and the Proceedings of the National Academy of Sciences of the United States of America. TESS also makes raw data from all experiments publicly available, regardless of whether the results are published. The genesis of TESS in 2001 followed on the heels of what could be called a revolution in political science field experiments in 2000. In that year, a field experiment on voter mobilization was published in the APSR (Gerber and Green 2000). This publication was notable since it was the 47th experimental article in the journal, but only the third field experiment, and the first field

4

James N. Druckman and Donald P. Green

experiment in nearly 20 years.2 This paper sparked burgeoning literatures on voter mobilization (e.g., Nickerson 2008) and vote choice (Wantchekon 2003); more generally, it ushered in the use of field experiments in other subfields (e.g., Findley et al. 2014; Hyde and Marinov 2014).3 The discipline established two other notable institutions about a decade later. In 2009, Evidence in Governance and Politics (EGAP) formed as a network for those engaged in field experiments on governance, politics, and institutions. EGAP played an important role in developing and advocating methodological practices such as preregistration of experiments and professional standards concerning the public disclosure of results. As it grew in membership and capacity, it also expanded its worldwide outreach efforts to include instruction on experimental methods across the Global South. In 2010, the first meeting of the American Political Science Association’s section on Experimental Research took place, and a year later it voted to launch the Journal of Experimental Political Science (the first issue of which appeared in 2014). These institutional innovations, too, were tracked by some notable publications. This list includes the explosion of experimental articles using Amazon’s Mechanical Turk to furnish research participants (Berinsky et al. 2012; Mullinix et al. 2015) and, in 2011, the predecessor to this book, the Cambridge Handbook of Experimental Political Science (Druckman et al. 2011).4

2 We do not count Gosnell (1926), since he did not seem to employ random assignment. 3 Since 2000, nearly 30 field experiments have been published in the APSR, and the Annual Review of Political Science has published several experiment-focused reviews on a range of topics, including collective action (de Rooij et al. 2009), developmental economics (Humphreys et al. 2009), political institutions (Grose 2014), and international relations (Hyde 2015). 4 Examples of other institutional developments include the launching of subject pools in more than a dozen political science departments (Druckman et al. 2018, p. 624) and a Routledge book series focused on experimental political science, Routledge Studies in Experimental Political Science.

These trends make clear that experiments now occupy a central place in political science. For reasons to which we turn next, the ways in which researchers design, analyze, and present experiments are rapidly changing, leading to new challenges and opportunities.

1.2 Technological Change and Open Science The initial rise of experiments followed on the heels of several technological advances. In the 1980s, the advent of computerassisted telephone interviewing facilitated the implementation of phone-based survey experiments (Sniderman and Douglas 1996). The pace of technological change has, if anything, accelerated in recent years. The costs and logistical challenges of data collection have dramatically dropped (e.g., Groves 2011), enabling researchers to access survey and behavioral data at a notably larger scale (e.g., Kramer et al. 2016). Consider four dynamics. First, as intimated above, data are now much cheaper and easier to obtain than ever before, thanks to the Internet and the emergence of crowdsourcing platforms and commercial Internet survey panels. These data are then easier to share due to the growing use of public data repositories, such as Dataverse and GitHub. The abundance of public data allowed, for example, Coppock (2019) to use 27 studies to show that individual attributes such as age, gender, race, and ideology do not consistently condition how individuals process political messages: the effects of many messages do not vary across subgroups, implying that we can generalize about the impact of isolated experiments to large segments of the population. Second, social media offer researchers access to behavioral data and the opportunity to intervene experimentally (e.g., Kramer et al. 2016), sometimes with literally millions of participants. Bond et al. (2012) conducted an experiment by delivering political mobilization messages to 61 million Facebook users, testing whether an “I Voted” widget

A New Era of Experimental Political Science

that announced one’s election participation to others increased turnout among Facebook users and their friends (see also Jones et al. 2017). Third, the advent of portable computers with high-resolution screens has made it easy for researchers to deploy surveys and lab-like treatments in field settings, which dramatically lowers logistical costs. For instance, Kim (2018) used a truck equipped with mobile television monitors, tablet computers, and chairs to conduct a lab-in-the-field study in three counties in rural Pennsylvania. The experiment shows that exposure to entertainment television with “rags-to-riches” narratives increases individuals’ belief in the American Dream, particularly for Republicans (also see Busby 2018). Fourth, advances in computing allow researchers to analyze high-dimensional data, which is to say data with large numbers of predictors or measurements. Computational requirements are especially demanding for algorithms that look for network effects (e.g., Grimmer et al. 2017). The same may be said for the rapidly growing list of techniques designed to automate the detection of treatment effect heterogeneity among subgroups in field experiments (Imai and Ratkovic 2013; Imai and Strauss 2011) and survey experiments (Green and Kern 2012). In the latter case, the authors revisit a large experimental literature based on General Social Surveys that have for decades asked national samples of Americans about their preferences regarding government spending. In the domain of social spending, question wording is varied randomly, and some respondents are asked about spending on “aid to the poor” while others are asked about spending on “welfare.” These surveys consistently show “aid to the poor” to be much more popular than “welfare,” but the question is: What sorts of respondents are especially susceptible to this effect? Rather than manually search for treatment-by-covariate interactions with education, party, ideology, and a slew of other background attributes, the authors use machine learning methods to conduct an

5

automated search that not only detects significant interactions, but also cross-validates the results using respondents who were randomly excluded from the initial round of exploration. Apart from technological advances, the social sciences have become increasingly attuned to challenges of accumulating knowledge given perverse incentives to exaggerate the size and statistical significance of treatment effects or, conversely, to bury weak or counterintuitive findings. The tendency for journals to publish splashy, statistically significant findings is often termed “publication bias” (Brown et al. 2017). Evidence of this bias in many disciplines is not new, but political scientists have only recently begun to document it (e.g., Gerber et al. 2010). In one notable example, Franco et al. (2014) show that of 221 experimental surveys, strong results are 40 percentage points more likely to be published than null results and 50 percentage points more likely to be written up. This is clear evidence of a publication bias at the writing and submission stages (also see Franco et al. 2016). One response to publication bias has been a call for more replications: emulating the extant study’s procedures, but with new data (“repeatability,” as described in Freese and Peterson 2017). Massive replication efforts have had mixed results, with the most widely discussed being the Open Science Collaboration’s (2015) effort in which more than 250 scholars attempted to replicate 100 experiments in three highly ranked psychology journals from 2008. They reported that “39% of effects were subjectively rated to have replicated the original result” (Open Science Collaboration 2015, p. 943). This finding has led some to sound alarm bells of a replication crisis (Baker 2016); however, the extent of this crisis continues to be debated (e.g., Fanelli 2018; Van Bavel et al. 2016), as other replication attempts, including those in political science, have had more success (e.g., Camerer et al. 2016; Coppock 2019; Mullinix et al. 2015). These replication attempts are possible in part because of a push for scholars to make their procedures, stimuli, surveys, and

6

James N. Druckman and Donald P. Green

data publicly available. In political science, most general and experimentally oriented journals require data access upon publication (Lupia and Elman 2014). Growing public access to data is of enormous value to instructors and meta-analysts, but also facilitates novel research. An example is Zigerell’s (2018, p. 1) reanalysis of 17 studies on racial discrimination (e.g., attitudes towards White or Black political candidates or job applicants). He reports for “White participants . . . pooled results did not detect a net discrimination for or against White targets, but, for Black participants . . . pooled results indicated the presence of a small-tomoderate net discrimination in favor of Black targets.” The opportunities that come from data sharing, replication debates, and related discussions have invigorated a call for “open science.” Nosek et al. (2015) identify standards of transparency and openness involving: citation standards; transparency of data, material, and analyses; preregistration of studies and analysis plans; and encouragement of replication studies. Interestingly, this move towards transparency has also generated some questions about respondent privacy as well as concerns about how respondents themselves react upon learning of data openness (Connors et al. 2019). In sum, fundamental technological and sociological changes have transformed the social sciences. The result, which coincided with the emergence of experiments as a primary method in political science, is what we call experimental political science 2.0.

1.3 Experimental Political Science 2.0 Experimental political science 2.0 is characterized by: (1) the introduction of previously underutilized designs; (2) the explosion of new data sources; (3) the use of new measurement techniques; (4) advancements in statistical methods; (5) increased discussion about robustness and generalizability; and (6) applications to novel areas of study. To get some sense of these trends, we analyzed the content of all experimental articles in the APSR that

made up Figure 1.1.5 In reporting the results, we first distinguish three time periods: all articles prior to 2000 constitute the lead-up to the experimental era; 2001–2009 make up the first generation of widespread use; and 2010– present is what we call experimental political science 2.0. These cutoffs roughly coincide with the aforementioned institutional developments (e.g., TESS, EGAP, the American Political Science Association’s Experimental Research section). Our interest is in the use and emergence of new approaches, and the statistics we present are the percentages of experimental articles in each era that used a given approach. We start with what we might call “nontraditional designs,” insofar as they are designs that received little application in early experiments in political science. We discuss them in more detail below, but they include conjoint surveys, audits, field experiments with surveys, lab-in-thefield studies, and natural experiments. In Figure 1.2, we report the percentages with which each of these designs were used out of all APSR experiments published in a time period. For instance, before 2000, of the 45 experiments published, 4% used one of the aforementioned designs. This number jumped to 13% in the second period and 32% in the most recent period – a clear trend towards increased application. We see a similar upward trend when we look at the proportion of studies that use what we might call “nontraditional subject pools,” including data from nonstudent convenience samples (e.g., crowdsourcing platforms), social media, or elites (e.g., legislators). The use of these subject pools jumped by 11 percentage points in the current era relative to the one that preceded it (52%–63%). Another change that came about largely with the rise of field experiments after 2000 was collaboration with organizational partners (e.g., nonprofits). Figure 1.2 shows that such collaborations increased starting in 2000 but remain fairly minimal, perhaps due 5 We thank Robin Bayes and Andrew Thompson for conducting the content analysis.

7

A New Era of Experimental Political Science

63%

52%

32% 44% 13% 4% Nontradional designs*

9%

Nontradional samples**

1956–1999

13% 4% Experiment with organizaonal partner

2000–2009

3% Explicit ethics discussion

2010–2019

*Conjoint, audit, field + survey, lab-in-the-field, natural **Nonstudent convenience, social media, elite

Figure 1.2 Experimental Trends.

to a lack of guidance on how to develop such partnerships (a topic we take up in this volume). Another important issue that undoubtedly will be addressed more frequently in the future is discussion of ethics. We identified only one experimental article in the APSR that included an explicit discussion of ethics in the main text of the paper (Paluck and Green 2009); growing recognition of ethical dilemmas in social science research (e.g., Teele 2014) will undoubtedly generate increased interest among both authors and audiences for further discussion of ethical issues. In addition to these trends of design and data, the field continues to evolve when it comes to measurement and statistical methods. As in much of the social sciences, political scientists have embraced new measurement techniques and sources, such as administrative records, social media behaviors, physiological measures,

and relatively unobtrusive measures of psychological processes. As for statistical methods, recent decades have seen growing sophistication in the use of techniques for detecting heterogeneous treatment effects (e.g., Grimmer et al. 2017; Ratkovic and Tingley 2017), spillovers between units (Aronow 2012; Bowers et al. 2016), and causal mechanisms (e.g., Acharya et al. 2016; Imai and Yamamoto 2013). These methods feature in just over 30% of the articles appearing during the earliest time period and have become much more commonplace since 2000 (roughly 50% of experiments). A distinct trend worth noting concerns the use of visuals – nearly all experimental articles used visuals in the last decade, up from just more than half in the preceding period. Another feature of experimental political science 2.0 echoes the aforementioned open science movement’s concern with robustness and generalizability. This approach

8

James N. Druckman and Donald P. Green

involves sustained discussion about reporting standards: one of the first actions of the American Political Science Association’s Experimental Research section was to form a reporting committee (e.g., Gerber et al. 2014, 2015; Mutz and Pemantle 2015). At roughly the same time, the data access and research transparency (DA-RT) movement in political science gained prominence. It arose from growing concerns about scholars’ failure to replicate a considerable number of empirical claims being made in top journals – often as a result of researchers’ inability or unwillingness to provide information about how they drew conclusions from their data or to make the data available to others (Lupia and Elman 2014). The initiatives require authors, including experimentalists, to provide data access, production transparency (e.g., procedures about how the data were collected), and analytic transparency (American Political Science Association 2012, pp. 9–10). There also are ongoing debates in the discipline about the need to register experiments so that researchers who later summarize literatures can see the extent to which research results went unreported. Another debate concerns preregistration of analysis plans, an initiative designed to limit researcher discretion and to clarify which analytic decisions were made in advance of seeing the data and which grew out of data exploration (Monogan 2015). Judging from public websites that record the use of preregistration and pre-analysis plans, their use has grown dramatically, and there seems to be an emerging norm among experimental researchers that best practices involve submitting these documents. A distinct but related development concerns increased discussion of how to generalize from experiments. Generalization is fundamentally a theoretical issue, but one that draws on empirical insights gleaned from the study of heterogeneous treatment effects across subjects, treatments, contexts, and outcomes. One way to advance this agenda is to conduct experiments in multiple contexts, as exemplified by EGAP’s Metaketa Initiative that “funds and coordinates studies across countries, clustered by theme, to improve

and incentivize innovative research alongside integrated analysis and publication” (https: // egap.org / our - work / the-metaketainitiative/). This is an exciting advance given that, to date, multicountry experiments are rare; our content analysis found only 6% of experiments included multiple countries in 2000–2009 and just 5% in the most recent decade. Of course, conducting experiments across countries requires careful thought about the comparability of measures across contexts; the qualitative data gathering that is used to validate and refine measurement reflects the disciplinary trend towards multimethods research (e.g., Seawright 2016). The final feature of experimental political science 2.0 is the application of the method to novel areas that historically have not used randomized controlled trials. As will be highlighted in the volume, this includes topics such as bureaucracy, corruption, and censorship – areas that can now be studied experimentally thanks to the aforementioned innovations in design, data access, and analysis. We next turn to how this volume is structured so as to help scholars, students, and practitioners navigate experimental political science 2.0. Our goal is to help experimental political scientists thoughtfully design studies, analyze data, present results, and expand the application of experiments.

1.4 This Volume We chose topics for the volume that are not only current, but also emergent. We hope to stay one step ahead of the curve. Perhaps most importantly, we opted for areas and authors that connect with one another – this book is not a jumble of standalone chapters. Common themes surface throughout, such as the importance of connecting theory to design, making design choices that maximize generalizable inference, and using experiments to extend the frontier of knowledge, which means exploring difficult and even dangerous topics. We organized the book into seven sections, but the chapters intersect both within and across sections. Each chapter includes an abstract, so instead of summarizing them

A New Era of Experimental Political Science

here, we highlight connections to provide readers with a roadmap of how the contributions relate to one another. The first section includes discussions of experimental designs that are (relatively) newly applied in political science. Conjoint studies – covered in a chapter by Bansak, Hainmueller, Hopkins, and Yamamoto – ask participants to make choices across multidimensional descriptions of people, policies, or issues; for instance, this approach may involve soliciting opinions about immigrants who vary in their country of origin, religion, age, education, language skills, etc. Audit experiments, covered by Butler and Crabtree, involve sending correspondence to public officials, randomly varying the nature of the messages, and testing whether the different messages elicit different responses. For example, does a legislator’s propensity to respond to constituent mail depend on whether the author has a putative White or Black name? Both conjoint and audit designs allow political scientists to gauge difficult-to-isolate behaviors such as racial discrimination, gender biases, or illegal actions because respondents remain unaware of what is being assessed (e.g., they are not directly asked about prejudice or corrupt behavior). The rigor and breadth of these experimental designs explain why they also play a central role in other parts of the book that use experiments to illuminate hidden or corrupt activities and identitybased discrimination. Applications of conjoint and audit designs depend on context – such as the level of scrutiny of hidden actions or the nature of gender norms. Two other designs focus even more on context. In their chapter, Kalla, Broockman, and Sekhon present a design that combines survey and field experiments – by first surveying respondents, then employing an ostensibly unrelated field intervention, and then surveying them again. This approach, which has clear cost advantages over other designs, is particularly germane to situations where field interventions seek to change attitudes and beliefs. Additionally, lab-in-thefield studies – where the lab is constructed in a field setting – allow researchers to

9

study choices that reflect subjects’ traits and strategic judgments. Eckel and Londono, in their chapter, detail several such examples, while also explaining best design practices. All four of these designs – audit, conjoint, field survey, and lab-in-the-field – constitute alternative approaches to measurement and casual inference across contexts. They also, in theory, could be combined – one could imagine a field survey study where the survey component includes a conjoint design. Stepping back from the details of specific designs, one may reflect on two larger issues. First, with one exception, experimental designs involve an intervention by the researcher. The exception is the so-called natural experiment, which has become popular in political science (e.g., Dunning 2012). But what counts as a natural experiment? What separates an experiment from a nonexperimental study that is said to involve an “as-if” random assignment? This question is taken up in the chapter by Titiunik. Her discussion clarifies what constitutes an actual experiment as opposed to a natural experiment and describes the advantages and disadvantages of each approach. Second, experimental interventions inherently involve ethical issues, since the researcher is changing the world in some way and, perhaps deceptively or unobtrusively, involving people in a research project. Teele’s chapter offers a discussion of how to think about the ethics of consent in experiments. The second section of the book covers data sources that have become more widely used in the last decade. Each of these chapters connects directly to themes raised in the design section. For instance, the goal of many audit studies is to explore racial or ethnic discrimination by political elites. This aim requires using elite samples, a topic covered by Grose in his chapter. Grose also discusses other designs (e.g., natural experiments) that have been used to study the behaviors of those who govern. Apart from elite samples, perhaps the most notable development when it comes to data sources is, as mentioned, the use of crowdsourcing platforms and nonprobability Internet panels. These sources

10

James N. Druckman and Donald P. Green

offer many research opportunities, but how to assess the impacts of these distinct samples is not always clear – this topic is addressed in the chapter by Krupnikov, Nam, and Style. Another recent data source comes from social media, which offer experimentalists opportunities new samples and behavioral measures, as well as a context within which to study social relationships. Guess’s contribution provides one of the first overviews of this emerging experimental literature. Finally, the aforementioned explosion of field experiments of varying types (e.g., lab-inthe-field, field survey) presents challenges to data collection with targeted populations. Partnering with organizations often can facilitate experimentation, but there is currently no “how to” guide for developing and sustaining collaborations. Levine offers this guidance in his chapter. Even if one does not anticipate using one of the data sources covered in this section, the reading is obligatory for anyone who wants to understand why a research program opts for a particular source of data. The third section of the book contains just two chapters but touches on issues fundamental to nearly all experiments: once a research question is formulated, treatments and measures must be developed, which in turn presents questions of validity and generalizability. Perhaps ironically, given the rise of experiments in the discipline, there exists limited guidance on how to develop and deploy treatments. Mutz’s chapter fills this gap, emphasizing the need to connect treatments to theory. For instance, if a labin-the-field study aims to explore the impact of emotion, the treatment needs to trigger emotion, even if it does so in a way that does not resemble a stimulus in the “real world.” Mutz stresses the importance of empirical verification that the intervention produces the intended change (e.g., in emotion) with no other unintended changes. This requirement involves delicate questions of the measurement and conceptualization of the theoretically specified treatment. As Mutz explains in her chapter, most work to date has not engaged in sufficient empirical verification. In their chapter,

Peterson, Westwood, and Iyengar also discuss ways to enhance treatments and measures, particularly in the context of survey experiments. A long standing problem with many survey experiments concerns the use of vignettes that sometimes convey information beyond what the researcher intended (e.g., Dafoe et al. 2018); another problem is social desirability bias, which occurs when research participants confect responses that they hope will please the interviewer. These authors provide advice on how to develop more valid treatments and outcome measures. This advice is of particular importance for experimentation because the objective measures they discuss facilitate symmetric comparisons across treatment and control groups, which are crucial for unbiased inference. The fourth section turns to long standing methodological issues and recent advances in addressing them. One such challenge is understanding the causal mechanisms by which an experimental intervention influences an outcome. In his chapter, Glynn starts by pointing out the formidable design and analysis challenges that arise when researchers attempt to isolate causal mechanisms; his review covers recent technical developments and their implications for applied research. Another burgeoning literature considers the challenges of drawing reliable inferences about which types of subjects are most responsive to treatment. Ratkovic’s review of this literature calls attention to the growing role that machine learning methods are playing in the discovery and validation of subgroup differences in responsiveness to treatment. In their chapter, Aronow, Eckles, Samii, and Zonszein address an assumption that is typically invoked in experimental analysis: namely, that subjects respond exclusively to their own treatment assignment and no one else’s. The chapter considers what happens when this assumption is relaxed and effects are transmitted across space or via a social network. The chapter’s more advanced material reviews the ways in which experimental researchers across the social sciences have come to design and analyze

A New Era of Experimental Political Science

experiments to detect spillovers of various types. The recurrent theme of analyzing data in ways that reflect the underlying experimental design culminates in Coppock’s chapter on visual presentation, which offers a series of presentation principles to guide experimental researchers. We are grateful to the publisher for printing Coppock’s chapter in color and hosting online the open-source code for his examples, so that readers can make the most of this work. The volume’s fifth section turns to foundational social science issues on how to conduct experimental political science research in a transparent, credible, and generalizable fashion. All of the chapters in this section are of relevance to social scientists who hope to use experiments going forward, regardless of design, sample, measurement, or method. The chapters by Boudreau and Malhotra assess the role of transparency and publication bias in experiments, respectively. A chapter by Seawright describes the benefits of taking a multi-method approach to experimentation. This chapter amplifies and illustrates themes from previous chapters: how to develop valid treatments, measure outcomes accurately, and detect spillover effects. Two chapters grapple with the issue of generalization. Much of the history of experimental political science has focused on the value of clear causal inference, but the newest generation of work asks for more – it wants to make broader statements that carry across samples and contexts. Hartman provides a discussion of the design assumptions that must be made to warrant generalization and discusses methods that attempt to meet these requirements. Blair and McClendon offer a framework for how communities of experimental researchers can learn from studies conducted in multiple contexts. They also explain how designs in particular contexts (e.g., countries) can be employed when the goal is to transport and generalize inferences about cause and effect. These kinds of ambitious designs are becoming increasingly common across subfields. Finally, we include two sections on substantive areas that are of special prominence and tied to the methodological issues

11

discussed in the other sections of the book. The first explores topics related to ethnic identity, racial identity, and gender. These are not new topics, but they have attracted increasing attention from experimental researchers across the globe. In her chapter, Spry introduces readers to experiments on identity. Her discussion of measurement calls attention to promising approaches that allow respondents to express multiple ethnic identities and differentiate between demographic categories and identification with those categories. Valenzuela and Reny’s chapter takes on the topic of ethnic and racial priming; while much has been learned on this topic, the authors point out that researchers have only begun to consider the range of priming effects and the contexts in which they occur. Klar and Schmitt, in their contribution, also discuss how political changes – in their case with regard to women in office and gender stereotypes – have affected the design of experiments on gender in elections. These authors engage an old literature – going back forty years – and highlight some long standing challenges of design and measurement. In their chapter, Clayton and Anderson-Nilsson review gender experiments in a comparative context, noting the empirical and theoretical challenges of explaining whether and when results generalize across settings. Addressing this question is difficult, and the authors discuss a host of design challenges, including ethical ones. The last section of the book continues the theme of applying experiments to complex topics that have only recently featured active experimentation. The authors discuss design and data obstacles, robust findings and gaps, and theoretical implications. Nathan and White’s chapter on experiments on street-level bureaucrats (e.g., social service administrators, election officials, police officers) complements earlier chapters on audit experiments and experiments involving elites. Their chapter instructs scholars on how to design studies to address a host of challenges involving statistical power, the potential for spoiling the sample pool, spillover between subjects, and ethical

12

James N. Druckman and Donald P. Green

constraints. Lagunes and Seim’s chapter takes up a related and similarly nettlesome topic for experimenters: corruption and corruption control. Corruption by its very nature is designed to elude detection, which makes social science measurement difficult and sometimes dangerous. Nonetheless, the authors offer a way forward that sheds light on micro-motives and institutional mechanisms to control corruption. Pan’s chapter looks at distinct governmental activities that are meant to be hidden, such as censorship and repression. Validity and ethical questions abound in this area, and Pan lays these out in a systematic manner, highlighting connections with other chapters, such as Butler and Crabtree’s, Nathan and White’s, and Lagunes and Seim’s. In her chapter, Matanock considers the challenges of using experiments to understand postconflict contexts. Addressing the vast literature on peace stabilization and peace consolidation, she highlights the role of experiments in understanding enduring peace. McGrath’s chapter on climate change highlights a multilayered global problem that involves citizens’ opinions and behaviors, policies, and international collaboration. Experiments are perhaps the most promising method for disentangling the causal processes that may help address one of the most pressing global challenges. The book concludes with reflections from Lynn Vavreck. She details the evolution of the field from narrow interventions to complex and ambitious experiments designed to elaborate theories. The result is that experiments now form a central part of the science of studying politics.

1.5 Conclusion Political science has come a long way since A. Lawrence Lowell’s 1909 presidential address to the American Political Science Association, when he notably stated, “We are limited by the impossibility of experiment. Politics is an observational, not an experimental science …” (Lowell 1910, p. 7). The last decade has made clear that experiments

are in fact possible in virtually all areas of the discipline. The question no longer is whether one can use experiments, but rather how to use them thoughtfully to shed light on political phenomena of theoretical and practical interest. This volume aims to ensure that experimentalists employ the method in ways that provide for the optimal accumulation of knowledge.

References Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining Causal Findings without Bias: Detecting and Assessing Direct Effects.” American Political Science Review 110(3): 1–18. American Political Science Association. 2012. A Guide to Professional Ethics in Political Science, 2nd ed. Washington, DC: American Political Science Association. Aronow, Peter M. 2012. “A General Method for Detecting Interference between Units in Randomized Experiments.” Sociological Methods & Research 41(1): 3–16. Baker, Monya. 2016. “Is There a Reproducibility Crisis?” Nature 533(7604): 452–54. Beath, Andrew, Fotini Christia, and Ruben Enikolopov. 2013. “Empowering Women through Development Aid: Evidence from a Field Experiment in Afghanistan.” American Political Science Review 107(3): 540–557. Berinsky, Adam J., Gregory A. Huber, and Gabriel S. Lenz. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20(3): 351–368. Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review 94(4): 991–1013. Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. 2012. “A 61-Million-Person Experiment in Social Influence and Political Mobilization.” Nature 489: 295–298. Bowers, Jake, Mark M. Fredrickson, and Peter M. Aronow. 2016. “Research Note: A More Powerful Test Statistic for Reasoning about Interference between Units.” Political Analysis 24(3): 395–403.

A New Era of Experimental Political Science Brown, Andrew W., Tapan S. Mehta, and David B. Allison. 2017. “Publication Bias in Science.” In The Oxford Handbook of the Science of Science Communication, eds. Kathleen Hall Jamieson, Dan M. Kahan, and Dietram A. Scheufele. New York: Oxford University Press, pp. 93–102. Busby, Ethan C. 2018. It’s All about Who You Meet: The Political Consequences of Intergroup Experiences with Strangers. PhD dissertation, Northwestern University. Butler, Daniel M., and David E. Broockman. 2011. “Do Politicians Racially Discriminate against Constituents? A Field Experiment on State Legislators.” American Journal of Political Science 55(3): 463–477. Buzin, Andrei, Kevin Brondum, and Graeme Robertson. 2016. “Election Observer Effects: A Field Experiment in the Russian Duma Election of 2011.” Electoral Studies 44: 184–191. Camerer, Colin F., Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, et al. 2016. “Evaluating Replicability of Labor Experiments in Economics.” Science 351(6280): 1433–6. Connors, Elizabeth C., Yanna Krupnikov, and John Barry Ryan. 2019. “How Transparency Affects Survey Responses.” Public Opinion Quarterly 83(1): 185–209. Coppock, Alexander. 2019. “Generalizing from Survey Experiments Conducted on Mechanical Turk: A Replication Approach.” Political Science Research and Methods 7(3): 613–628. Costa, Mia. 2017. “How Responsive are Political Elites? A Meta-Analysis of Experiments on Public Officials.” Journal of Experimental Political Science 4(3): 241–254. Dafoe, Allan, Baobao Zhang, and Devin Caughey. 2018. “Information Equivalence in Survey Experiments.” Political Analysis 26(4): 399–416. De Rooij, Eline A., Donald P. Green, and Alan S. Gerber. 2009. “Field Experiments on Political Behavior and Collective Action.” Annual Review of Political Science 12: 389–395. Druckman, James N., Adam J. Howat, and Kevin J. Mullinix. 2018. “Graduate Advising in Experimental Research Groups.” PS: Political Science & Politics 51(3): 620–624. Druckman, James N., Donald P. Green, James H. Kuklinski, and Arthur Lupia. 2006. “The Growth and Development of Experimental Research in Political Science.” American Political Science Review 100(4): 627–635. Druckman, James N., Donald P. Green, James H. Kuklinski, and Arthur Lupia, eds. 2011.

13

Cambridge Handbook of Experimental Political Science. New York: Cambridge University Press. Dunning, Thad. 2012. Natural Experiments in the Social Sciences: A Design-Based Approach. Strategies for Social Inquiry. New York: Cambridge University Press. Dunning, Thad, Guy Grossman, Macartan Humphreys, Susan D. Hyde, Craig McIntosh, and Gareth Nellis, eds. 2019. Information, Accountability, and Cumulative Learning: Lessons from Metaketa I. New York: Cambridge University Press. Fanelli, Daniele. 2018. “Is Science Really Facing a Reproducibility Crisis?” Proceedings of the National Academy of Sciences of the United States of America 115(11): 2628–2631. Findley, Michael G., Daniel L. Nielson, and Jason Campbell Sharman. 2014. Global Shell Games: Experiments in Transnational Relations, Crime, and Terrorism. New York: Cambridge University Press. Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2014. “Publication Bias in Social Science: Unlocking the File Drawer.” Science 345(6203): 1502–1505. Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2016. “Underreporting in Psychology Experiments from a Study Registry.” Social Psychology and Personality Science 7(1): 8–12. Freese, Jeremy, and David Peterson. 2017. “Replication in Social Science.” Annual Review of Sociology 43: 147–165. Gerber, Alan S., and Donald P. Green. 2000. “The Effects of Canvassing, Telephone Calls, and Direct Mail on Voter Turnout: A Field Experiment.” American Political Science Review 94(3): 653–663. Gerber, Alan, Kevin Arceneaux, Cheryl Boudreau, Conor Dowling, Sunshine Hillygus, Thomas Palfrey, Daniel R. Biggers, and David J. Hendry. 2014. “Reporting Guidelines for Experimental Research: A Report from the Experimental Research Section Standards Committee.” Journal of Experimental Political Science 1(1): 81–98. Gerber, Alan S., Kevin Arceneaux, Cheryl Boudreau, Conor Dowling, and Sunshine Hillygus. 2015. “Reporting Balance Tables, Response Rates and Manipulation Checks in Experimental Research: A Reply from the Committee that Prepared the Reporting Guidelines.” Journal of Experimental Political Science 2(2): 216–229.

14

James N. Druckman and Donald P. Green

Gerber, Alan S., Neil Malhotra, Connor M. Dowling, and David Doherty. 2010. “Publication Bias in Two Political Behavior Literature.” American Political Research 38(4): 591–613. Gosnell, Harold F. 1926. “An Experiment in the Stimulation of Voting.” American Political Science Review 20(4): 869–874. Green, Donald P., and Andrej Tusicisny. 2012. “Statistical Analysis of Results from Laboratory Studies in Experimental Economics: A Critique of Current Practices.” Paper presented at the North American Economic Science Association (ESA) Conference, Tucson, AZ. Green, Donald P., and Holger L. Kern. 2012. “Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees.” Public Opinion Quarterly 76(3): 491–511. Grimmer, Justin, Solomon Messing, and Sean J. Westwood. 2017. “Estimating Heterogeneous Treatment Effects and the Effects of Heterogeneous Treatments with Ensemble Methods.” Political Analysis 25(4): 413–434. Grose, Christian R. 2014. “Field Experimental Work on Political Institutions.” Annual Review of Political Science 17: 355–370. Grossman, Guy, and Kristin Michelitch. 2018. “Information Dissemination, Competitive Pressure, and Politician Performance between Elections: A Field Experiment in Uganda.” American Political Science Review 112(2): 280–301. Groves, Robert M. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75(5): 861–871. Humphreys, Macartan, and Jeremy M. Weinstein. 2009. “Field Experiments and the Political Economy of Development.” Annual Review of Political Science 12: 367–378. Hyde, Susan D. 2015. “Experiments in International Relations: Lab, Survey, and Field.” Annual Review of Political Science 18: 403–424. Hyde, Susan D., and Nikolay Marinov. 2014. “Information and Self-Enforcing Democracy: The Role of International Election Observation.” International Organization 68(2): 329–359. Ichino, Nahomi, and Matthias Schündeln. 2012. “Deterring or Displacing Electoral Irregularities? Spillover Effects of Observers in a Randomized Field Experiment in Ghana.” The Journal of Politics 74(1): 292–307. Imai, Kosuke, and Marc Ratkovic. 2013. “Estimating Treatment Effect Heterogeneity in Ran-

domized Program Evaluation.” The Annals of Applied Statistics 7(1): 443–470. Imai, Kosuke, and Aaron Strauss. 2011. “Estimation of Heterogeneous Treatment Effects from Randomized Experiments, with Application to the Optimal Planning of the Get-Outthe-Vote Campaign.” Political Analysis 19(1): 1–19. Imai, Kosuke, and Teppei Yamamoto. 2013. “Identification and Sensitivity Analysis for Multiple Causal Mechanisms: Revisiting Evidence from Framing Experiments.” Political Analysis 21(2): 141–171. Jones, Jason J., Robert M. Bond., Dean Eckles, and James H. Fowler. 2017. “Social Influence and Political Mobilization: Further Evidence from a Randomized Experiment in the 2012 U.S. Presidential Election.” PLoS ONE 12(4): e0173851 Kertzer, Joshua D., and Ryan Brutger. 2016. “Decomposing Audience Costs: Bringing the Audience Back into Audience Cost Theory.” American Journal of Political Science 60(1): 234–249. Kim, Eunji. 2018. “Entertaining Beliefs in Economic Mobility.” Working Paper, University of Pennsylvania. Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences of the United States of America 111(24): 8788–8790. Lowell, A. Lawrence. 1910. “The Physiology of Politics.” American Political Science Review 4(1): 1–15. Lupia, Arthur, and Colin Elman. 2014. “Openness in Political Science: Data Access and Research Transparency.” PS: Political Science and Politics 47(1): 19–42. Monogan, James. E. 2015. “Research Preregistration in Political Science: The Case, Counterarguments, and a Response to Critiques.” PS: Political Science and Politics 48(3): 425–429. Mullinix, Kevin J., Thomas J. Leeper, James N. Druckman, and Jeremy Freese. 2015. “The Generalizability of Survey Experiments.” Journal of Experimental Political Science 2(2): 109–138. Mutz, Diana C. 2011. Population-Based Survey Experiments. Princeton, NJ: Princeton University Press. Mutz, Diana C., and Robin Pemantle. 2015. “Standards for Experimental Research: Encouraging a Better Understanding of Experimental

A New Era of Experimental Political Science Methods.” Journal of Experimental Political Science 2(2): 192–215. Nickerson, David W. 2008. “Is Voting Contagious? Evidence from Two Field Experiments.” American Political Science Review 102(1): 49–57. Nosek, Brian A., Geroge Alter, Geroge C. Banks, Denny Borsboom, Sara D. Bowman, Steven J. Breckler, et al. 2015. “Promoting an Open Research Culture.” Science 348(6242): 1422–1425. Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349: aac4716. Pager, Devah. 2003. “The Mark of a Criminal Record.” American Journal of Sociology 108(5): 937–975. Paluck, Elizabeth Levy, and Donald P. Green. 2009. “Deference, Dissent and Dispute Resolution: An Experimental Intervention Using Mass Media to Change Norms and Behavior in Rwanda.” Political Science Review 103(4): 622–644. Ratkovic, Marc, and Dustin Tingley. 2017. “Sparse Estimation and Uncertainty with Application to Subgroup Analysis.” Political Analysis 25(1): 1–40. Rogowski, Ronald. 2015. “The Rise of Experimentation in Political Science.” In Emerging Trends in the Social and Behavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource, eds. Robert A. Scott and Stephen M. Kosslyn. Hoboken, NJ: John Wiley & Sons, pp. 1–16.

15

Seawright, Jason. 2016. Multi-Method Social Science: Combining Qualitative and Quantitative Tools. Cambridge, UK: Cambridge University Press. Scacco, Alexandra, and Shana S. Warren. 2018. “Can Social Contact Reduce Prejudice and Discrimination? Evidence from a Field Experiment in Nigeria.” American Political Science Review 112(3): 654–677. Sniderman, Paul M., and Douglas B. Grob. 1996. “Innovations in Experimental Design in Attitude Surveys.” Annual Review of Sociology 22(1): 377–399. Teele, Dawn Langan, ed. 2014. Field Experiments and Their Critics: Essays on the Uses and Abuses of Experimentation in the Social Sciences. New Haven, CT: Yale University Press. Tomz, Michael. 2007. “Domestic Audience Costs in International Relations: An Experimental Approach.” International Organization 61(4): 821–840. Van Bavel, Jay J., Peter Mende-Siedleckia, William J. Bradya, and Diego A. Reinero. 2016. “Contextual Sensitivity in Scientific Reproducibility.” Proceedings of the National Academy of Sciences of the United States of America 113(23): 6454–6459. Wantchekon, Leonard. 2003. “Clientelism and Voting Behavior: Evidence from a Field Experiment in Benin.” World Politics 55(3): 399–422. Zigerell, L. J. 2018. “Black and White Discrimination in the United States: Evidence from an Archive of Survey Experiment Studies.” Research and Politics 5(1): 1–7.

Part I

E X P E R I M E N TA L DE S I G NS

CHAPTER 2

Conjoint Survey Experiments∗

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

Abstract Conjoint survey experiments have become a popular method for analyzing multidimensional preferences in political science. If properly implemented, conjoint experiments can obtain reliable measures of multidimensional preferences and estimate causal effects of multiple attributes on hypothetical choices or evaluations. This chapter provides an accessible overview of the methodology for designing, implementing, and analyzing conjoint survey experiments. Specifically, we begin by detailing a substantive example: How do candidate attributes affect the support of American respondents for candidates running against President Trump in 2020? We then discuss the theoretical underpinnings and advantages of conjoint designs. We next provide guidelines for practitioners in designing and analyzing conjoint survey experiments. We conclude by discussing further design considerations, common conjoint applications, common criticisms, and possible future directions.

2.1 Introduction Political and social scientists are frequently interested in how people choose between * The authors express their gratitude to James N. Druckman, Donald P. Green, Alexander Coppock, and participants at the May 2019 “Northwestern Experimental Conference” for comments that significantly improved the manuscript. They also thank Emma Arsekin, David Azizi, Isaiah Gaines, and Sydney Loh for insightful research assistance.

options that vary in multiple ways. For example, a voter who prefers candidates to be experienced and opposed to immigration may face a dilemma if an election pits a highly experienced immigration supporter against a less experienced immigration opponent. One might ask similar questions about a wide range of substantive domains – for instance, how people choose whether and whom to date, which job to take, and where to rent or buy a home. In all of these examples, 19

20

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

and in many more, people must choose among multiple options that are themselves collections of attributes. In making such choices, people must not only identify their preferences on each particular dimension, but also make trade-offs across the dimensions. Conjoint analysis is a survey-experimental technique that is widely used as a tool to answer these types of questions across the social sciences. The term originates in the study of “conjoint measurement” in 1960s mathematical psychology, when founding figures in the behavioral sciences such as Luce and Tukey (1964) developed axiomatic theories for decomposing “complex phenomena into sets of basic factors according to specifiable rules of combination” (Tversky 1967). Since the seminal publication of Green and Rao (1971), however, the term “conjoint analysis” has primarily been used to refer to a class of survey-experimental methods that estimates respondents’ preferences given their overall evaluations of alternative profiles that vary across multiple attributes, typically presented in tabular form. Traditional conjoint methods drew heavily on the statistical literature on the design of experiments (DOE) (e.g., Cox 1958), in which theories of complex factorial designs were developed for industrial and agricultural applications. However, conjoint designs became especially popular in marketing (see Raghavarao et al. 2011), as it was far easier to have prospective customers evaluate hypothetical products on paper than to build various prototypes of cars or hotels. Conjoint designs were also frequently employed in economics (Adamowicz et al. 1998) and sociology (Jasso and Rossi 1977; Wallander 2009), often under different names such as “stated choice methods” or “factorial surveys.” In the era before computerassisted survey administration, respondents would commonly have to evaluate dozens of hypothetical profiles printed on paper, and even then, analysis proceeded under strict assumptions about the permissible interactions among the attributes. Only in recent years, however, have conjoint survey experiments come to see

extensive use in political science (e.g., Abrajano et al. 2015; Bansak et al. 2016; Bechtel et al. 2019; Carnes and Lupu 2016; Franchino and Zucchini 2015; Hainmueller and Hopkins 2015; Horiuchi et al. 2018; Lowen et al. 2012; Mummolo and Nall 2016; Wright et al. 2016). This development has been driven partly by the proliferation of computer-administered surveys and by the concurrent ability to conduct fully randomized conjoint experiments at low cost. Reflecting the explosion of conjoint applications in academic political science publications, a conjoint analysis of Democratic voters’ preferences for presidential candidates even made an appearance on television via CBS News in the spring of 2019 (Khanna 2019). A distinctive feature of this strand of empirical literature is a new statistical approach to conjoint data based on the potential outcomes framework of causal inference (Hainmueller et al. 2014), which is in line with the explosion in experimental methods in political science generally since the early 2000s (Druckman et al. 2011). Along with this development, the past several years have also seen valuable advances in the statistical methods for analyzing conjoint data that similarly build on modern causal inference frameworks (Acharya et al. 2018; Dafoe et al. 2018; Egami and Imai 2019). In this chapter, we introduce conjoint survey experiments, summarize recent research employing them and improving their use, and discuss key issues that emerge when putting them to use. We do so partly through the presentation and discussion of an original conjoint application in which we examine an opt-in sample of Americans’ attitudes toward prospective 2020 Democratic presidential nominees.

2.2 An Empirical Example: Candidates Running against President Trump in 2020 To illustrate how one might implement and analyze a conjoint survey experiment, we conducted an original survey on an online,

Conjoint Survey Experiments

21

Figure 2.1 An example conjoint table from the Democratic primary experiment. The full set of possible attribute values is provided in Table 2.1.

opt-in sample of 503 Amazon Mechanical Turk (MTurk) workers (for discussion of these types of samples, see the Krupnikov, Nam, and Style chapter in this volume). We designed our experiment to be illustrative of a typical conjoint design in political science. Specifically, we presented respondents with a series of tables showing profiles of hypothetical Democratic candidates running in the 2020 US presidential election. We asked: “This study is about voting and about your views on potential Democratic candidates for President in the upcoming 2020 general election . . . please indicate which of the candidates you would prefer to win the Democratic primary and hence run against President Trump in the general election” (emphasis in the original). We then presented a table that contained information about two political candidates side by side, described as “CANDIDATE A” and “CANDIDATE B,” which were purported to represent hypothetical Democratic

candidates for the 2020 election. Figure 2.1 shows an example table from the experiment. As is shown in Figure 2.1, conjoint survey experiments typically employ a tabular presentation of multiple pieces of information representing various attributes of hypothetical objects. This table is typically referred to as a “conjoint table” since it combines a multitude of varying attributes and presents them as a single object. In our experiment, we used a table containing two profiles of hypothetical Democratic candidates varying in terms of their age, gender, sexual orientation, race/ethnicity, previous occupation, military service, prior political experience, and positions on healthcare policy, immigration policy, and climate change policy. Table 2.1 shows the full set of possible levels for each of the attributes. We worked to choose a range of attributes that would be likely to be salient to voters during actual choices among primary candidates, but of course the conjoint

22

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

Table 2.1 The list of possible attribute values in the Democratic primary experiment. Age

37, 45, 53, 61, 77

Gender

Female, Male

Sexual Orientation

Straight, Gay

Race/Ethnicity

White, Hispanic/Latino, Black, Asian

Previous Occupation Military Service Experience

Business executive, College professor, High school teacher, Lawyer, Doctor, Activist Did not serve, Served in the Army, Served in the Navy, Served in the Marine Corps

Prior Political Experience

Small-city Mayor, Big-city Mayor, State Legislator, Governor, U.S. Senator, U.S. Representative, No prior political experience

Supports Government Healthcare for

All Americans, Only Americans who are older, poor, or disabled, Americans who choose it over private health plans

Supports Creating Pathway to Citizenship for

Unauthorized immigrants with no criminal record who entered the U.S. as minors, All unauthorized immigrants with no criminal record, No unauthorized immigrants Ban the use of fossil fuels after 2040, reducing economic growth by 5%; Impose a tax on using fossil fuels, reducing economic growth by 3%; Promote the use of renewable energy but allow continued use of fossil fuels

Position on Climate Change

presentation will make select attributes more salient than their real-world counterparts while ignoring others. The levels presented in each table were then randomly varied, with randomization occurring independently across respondents, across tables, and across attributes. Each respondent was presented 15 such randomly generated comparison tables on separate screens, meaning that they evaluated a total of 30 hypothetical candidates (i.e., 15 choices between candidates). In order to preserve a smooth survey-taking experience, the order in which attributes were presented was held fixed across all 15 tables for each individual respondent, though the order was randomized across respondents. Put differently, every respondent saw the attributes in the same order for each of the 15 scenarios, but that order randomly varied across respondents. After presenting each of the conjoint tables with randomized attributes, we asked respondents two questions to measure their preferences about the hypothetical candidate profiles just presented. Specifically, we used a seven-point rating of the profiles (top of Figure 2.2) and a forced choice between the two profiles (bottom of Figure 2.2). We asked:

“On a scale from 1 to 7 . . . how would you rate each of the candidates described above?” and also: “Which candidate profile would you prefer for the Democratic candidate to run against President Trump in the general election?” The order of these two items was randomized (at the respondent level) so that we would be able to identify any order effects on outcome measurement if necessary. The substantive goal of our conjoint survey experiment was twofold and can be encapsulated by the following questions. First, what attributes causally increase or decrease the appeal of a Democratic primary candidate, on average, when varied independently of the other candidate attributes included in the design? As we discuss later in the chapter, the random assignment of attribute levels allows researchers to answer this question by estimating a causal effect called the average marginal component effect (AMCE) using simple statistical methods such as linear regression. Second, do the effects of the attribute vary depending on whether the respondent is a Democrat, Republican, or independent? For respondents who are Democrats, the conjoint task simulated the choice of their own presidential candidate to run against President Trump in

Conjoint Survey Experiments

23

Figure 2.2 Outcome variables in the Democratic primary experiment.

the 2020 presidential election. So the main trade-off for them was whether to choose a candidate who was electable or a candidate who represented their own policy positions more genuinely. On the other hand, for Republican respondents, considerations were likely to be entirely different (at least for those who intended to vote for President Trump). As we show later, these questions can be answered by estimating conditional AMCEs (i.e., the average effects of the attributes conditional on a respondent characteristic measured in the survey, such as partisanship).

2.3 Advantages of Conjoint Designs over Traditional Survey Experiments Our Democratic primary experiment represents a typical example of the conjoint survey experiments widely implemented across the empirical subfields of political science. A few factors have driven the upsurge in the use of conjoint survey experiments. First, there has been increased attention to causal inference and to experimental designs that allow for inferences about causal effects via assumptions made credible by the experimental design itself (Sniderman and Grob 1996). At the same time, however, researchers are often interested in testing hypotheses that go beyond the simple causeand-effect relationship between a single

binary treatment and an outcome variable. Traditional survey experiments are typically limited to analyzing the average effects of a few randomly assigned treatments, constraining the range of substantive questions researchers can answer persuasively. In contrast, conjoint experiments allow researchers to estimate the effects of many attributes simultaneously, and so can permit analysis of more complex causal questions. A second enabling factor is the rapid expansion of surveys administered via computer, which enables researchers to use fully randomized conjoint designs (Hainmueller et al. 2014). Fully randomized designs, in turn, facilitate the estimation of key quantities such as AMCEs via straightforward statistical estimation procedures that rely little on modeling assumptions. Moreover, commonly used web-based survey interfaces facilitate the implementation of complex survey designs such as conjoint experiments. A third underlying factor behind the rise of conjoint designs within political science is their close substantive fit with key political science questions. For example, political scientists have long been interested in how voters choose among candidates or parties, a question for which conjoint designs are well suited. By quantifying the causal effects of various candidate attributes presented simultaneously, conjoint designs enable researchers to explore a wide range of

24

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

hypotheses about voters’ preferences, relative sensitivities to different attributes, and biases. But beyond voting, multidimensional choices and preferences are of interest to political scientists in many contexts and issue areas, such as immigration, neighborhoods and housing, and regulatory policy packages. As we discuss later in this chapter, conjoint designs have been applied in each of these domains and beyond. Fourth, political scientists are often interested in measuring attitudes and preferences that might be subject to social desirability bias. Scholars have argued that conjoint designs can be used as effective measurement tools for socially sensitive attitudes, such as biases against female political candidates (Teele et al. 2018) and opposition to siting a low-income housing project in one’s neighborhood (Hankinson 2018). When respondents evaluate several attributes simultaneously, they may be less concerned that researchers will connect their choices to one specific attribute. In keeping with this expectation, early evidence suggests that fully randomized conjoint designs do indeed mitigate social desirability bias by asking about a socially sensitive attribute along with a host of other randomly varying attributes (Horiuchi et al. 2019). Finally, evidence suggests that conjoint designs have desirable properties in terms of validity. On the dimension of external validity, Hainmueller et al. (2015) find that certain conjoint designs can effectively approximate real-world benchmarks in Swiss citizenship votes, while Auerbach and Thachil (2018) find that political brokers in Indian slums have the attributes that local residents reported valuing via a conjoint experiment. Conjoint designs have also proven to be quite robust. For one thing, online, optin respondents commonly employed in social science research can complete many conjoint tasks before satisficing demonstrably degrades response quality (Bansak et al. 2018). Such respondents also prove able to provide meaningful and consistent responses even in the presence of a large number of attributes (Bansak et al. 2019; see also Jenke et al. 2020).

In short, conjoint designs have a range of theoretical and applied properties that make them attractive to political scientists. But, of course, no method is appropriate for all applications. Later in this chapter, we therefore flag the limitations of conjoint designs as well as the open questions about their usage and implementation.

2.4 Designing Conjoint Survey Experiments When implementing a conjoint experiment, survey experimentalists who are new to conjoint analysis face a multitude of design considerations. Here, we review a number of key components of a conjoint design that have implications for conjoint measurement and offer guidance on how to approach them, using the Democratic primary experiment as a running example. 2.4.1 Number of Profiles In the Democratic primary experiment, we used a “paired-profile” design in which each conjoint table contained two profiles of hypothetical Democratic candidates. But other designs are also possible. One example is a “single-profile” design in which each table presents only one set of attribute values; another is a multiple-profile design that contains more than two profiles per table. Empirically, paired-profile designs appear to be the most popular choice among political scientists, followed by single-profile designs. Hainmueller et al. (2015) provide empirical justification for this choice, showing that paired-profile designs tend to perform well compared to single-profile designs, at least in the context of their study of Swiss voting on naturalization. In other contexts, a single profile or more than two profiles may be most appropriate, subject of course to limitations in respondents’ ability to compare many profiles simultaneously. 2.4.2 Number of Attributes An important practical question is how many attributes to include in a conjoint experiment.

Conjoint Survey Experiments

Here, researchers face a difficult trade-off between masking and satisficing (Bansak et al. 2019). On the one hand, including too few attributes will make it difficult to interpret the substantive meaning of AMCEs, since respondents might associate an attribute with another that is omitted from the design. Such a perceived association between an attribute included in the design and another omitted attribute muddies the interpretation of the AMCE of the former as it may represent the effects of both attributes (i.e., masking; for more, see Bansak et al. 2019; Dafoe et al. 2018). In our Democratic primary experiment, for example, the AMCEs of the policy position attributes might mask the effect of other policy positions that are not included in the design if respondents associate a liberal position on the former with a similarly liberal position on the latter. On the other hand, including too many attributes might increase the cognitive burden of the tasks excessively, inducing respondents to satisfice (Krosnick 1999). Given the inherent trade-off, how many attributes should one use in a conjoint experiment? Although the answer to the question is likely to be highly context dependent, Bansak et al. (2019) provide useful evidence that subjects recruited from popular online survey platforms such as MTurk are reasonably resistant to satisficing due to the increase in the number of conjoint attributes. Based on the evidence, they conclude that the upper bound on the permissible number of conjoint attributes for online surveys is likely to be above those used in typical conjoint experiments in political science, such as our Democratic primary example in which 10 attributes were used. Jenke et al. (2020) also explore the robustness of conjoint experiments to the addition of attributes by using eye-tracking methods to examine how respondents process information in conjoint surveys, administered to university students and local community members. They find that respondents are able to adapt to the increased complexity of additional attributes and to reduce cognitive processing costs by selectively incorporating relevant new information into their choices (and ignoring

25

less relevant information). Of course, how many attributes might be too many also likely depends on the sample of respondents and the mode of delivery. 2.4.3 Randomization of Attribute Levels Regardless of the number of profiles per table, conjoint designs entail a random assignment of attribute values. The canonical, fully randomized conjoint experiment randomly draws a value for each attribute in each table from a prespecified set of possible values (Hainmueller et al. 2014). This makes the fully randomized conjoint experiment a particular type of factorial experiment, on which an extensive literature exists in the field of DOE. In our experiment, for example, we chose the set of possible values for the age attribute to be [33, 45, 53, 61, 77], and we randomly picked one of these values for each profile with equal probability (= 1/5). As discussed later, the random assignment of attribute values enables inference about the causal effects of the attributes without reliance on untestable assumptions about the form of respondents’ utility functions or the absence of interaction effects (Hainmueller et al. 2014).1 In most existing applications of conjoint designs in political science, attributes are randomized uniformly (i.e., with equal probabilities for all levels in a given attribute) and independently from one another. Although uniform independent designs are attractive because of parsimony and ease of implementation, the conjoint design can accommodate other kinds of randomization distributions. Often, researchers have good reasons to deviate from the standard uniform independent design for the sake of realism and external validity (Hainmueller et al. 1 In marketing science, researchers often use conjoint designs that do not employ randomization of attributes. This alternative approach relies on the theory of orthogonal arrays and fractional factorial designs derived from the classical DOE literature, as opposed to the potential outcomes framework for causal inference (Hainmueller et al. 2014). The discussion of this traditional approach is beyond the scope of this chapter, although there exist a small number of applications of this approach in political science (e.g., Franchino and Zucchini 2015).

26

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

2014). In designing our experiment, for example, we wanted to ensure that the marginal distributions of the candidate attributes were roughly representative of the attributes of the politicians who were considered to be likely candidates in the actual Democratic primary election at that time. Thus, in addition to choosing attribute values that matched those of the actual likely candidates, we employed a weighted randomization such that some values would be drawn more frequently than others. Specifically, we made our hypothetical candidates more likely: to be straight than gay (with 4:1 odds); to be White than Black, Latino/Hispanic, or Asian (6:2:2:1); and to have never served in the military than to have served in the Army, Navy, or Marine Corps (4:1:1:1). Weighted randomization causes no fundamental threat to the validity of causal inference in conjoint analysis, although it introduces some important nuances in the estimation and interpretation of the results. We will come back to these issues in the next section. Another possible “tweak” to the randomization distribution is to introduce dependence between some attributes (Hainmueller et al. 2014). The most common instance of this is restricted randomization, or prohibiting certain combinations of attribute values from happening. Restricted randomization is typically employed to ensure that respondents will not encounter completely unrealistic (or sometimes even logically impossible) profiles. For example, in the “immigration conjoint” study reported in Hainmueller et al. (2014), the authors impose the restriction that immigrants with high-skilled occupations must at least have a college degree. In our current Democratic primary experiment, we chose not to impose any such “hard” constraints on the randomization distribution because we chose attribute values that were all reasonably plausible to co-occur in an actual profile of a Democratic candidate. Like weighted randomization, restricted randomization does not pose a fundamental problem for making valid causal inferences from conjoint experiments, unless it is taken to

the extreme. However, restricted designs require care in terms of estimation and interpretation, especially when it is not clear what combinations of attributes make a profile unacceptably unrealistic. More discussion is found later in this chapter. 2.4.4 Randomization of Attribute Ordering In addition to randomizing the values of attributes, it is often recommended to randomize the order of the attributes in a conjoint table, so that the causal effects of attributes themselves can be separately identified from pure order effects (e.g., the effects of an attribute being placed near the top of the table vs. towards the bottom). In many applications, attribute ordering is better randomized at the respondent level (i.e., for a given respondent, randomly order attributes in the first table and fix the order throughout the rest of the experiment), and that is precisely what we did in the experiment presented here. This is because reshuffling the order of attributes from one table to another is likely to cause excessive cognitive burden for respondents (Hainmueller et al. 2014). 2.4.5 Outcome Measures After presenting a conjoint table with randomized attributes, researchers then typically ask respondents to express their preferences with respect to the profiles presented. These preferences can be measured in various ways, and those measurements then constitute the outcome variable of interest in the analysis of conjoint survey data. The individual rating and forced choice outcomes are the two most common measures of stated preference in political science applications of conjoint designs, and there are distinct advantages to each. On the one hand, presenting a forced choice may compel respondents to think more carefully about trade-offs. On the other hand, individual ratings (or non-forced choices where respondents can accept/reject all profiles presented) allow respondents to express approval or disapproval of each profile without constraints, which also allows for

Conjoint Survey Experiments

the identification of respondents that categorically accept/reject all profiles.2 It is important to note that whether respondents are forced to choose among conjoint profiles or are able to rate them individually can influence one’s conclusions, so it is often valuable to elicit preferences about profiles in multiple ways. Indeed, researchers commonly ask respondents to both rank profiles within a group and to rate each profile individually. 2.4.6 Number of Tasks In typical conjoint survey experiments in political science, the task (i.e., a randomly generated table of profiles followed by outcome measurements) is repeated multiple times for each respondent, each time drawing a new set of attribute values from the same randomization distribution. In our Democratic primary experiment, respondents were given 15 paired comparison tasks, which means they evaluated a total of 30 hypothetical candidate profiles. One important advantage of conjoint designs is that one can obtain many more observations from a given number of respondents without compromising validity than a traditional survey experiment, where within-subject designs are often infeasible due to validity concerns. This, together with the fact that one can also test the effects of a large number of attributes (or, equivalently, treatments) at once, makes the conjoint design a highly costefficient empirical strategy. One concern, however, is the possibility of respondent fatigue when the number of tasks exceeds respondents’ cognitive capacity. The question then is: How many tasks are too many? The answer is likely highly dependent on the nature of the conjoint task (e.g., how complicated the profiles are) and of the respondents (e.g., how familiar they are with the subject matter at hand), so it is wise to err on the conservative side. 2 One open question for future research is just how many outcome variables researchers can ask about different aspects of each conjoint task, but in our experience, it can be quite valuable to have respondents assess the profiles in multiple ways.

27

However, Bansak et al. (2018) empirically show that inferences from conjoint designs are robust with respect to the number of tasks for samples recruited from commonly used online opt-in panels. In particular, their findings indicate that it is safe to use as many as 30 tasks with respondents from MTurk and Survey Sampling International’s online panel without detectable degradation in response quality. Although one should be highly cautious in extrapolating their findings to other samples, it appears to reinforce the use of 15 tasks in our Democratic primary experiment, which draws on MTurk respondents. To be sure, researchers need to consider the overall survey length and its effects on attrition and respondent engagement as well. 2.4.7 Variants of Conjoint Designs Finally, a survey experimental design that is closely related to the conjoint experiment is the so-called vignette experiment. Like a conjoint experiment, a vignette experiment typically describes a hypothetical object that varies in terms of multiple attributes and asks respondents to either rate or choose their preferred profiles. The key difference is that a profile is presented as a descriptive text as opposed to a table. For example, a vignette version of our Democratic primary experiment would use a paragraph like the following to describe the profile of a candidate: “CANDIDATE A is a 37-yearold straight Black man with no past military service or political experience. He used to be a college professor. He supports providing government healthcare for all Americans, creating a pathway to citizenship for unauthorized immigrants with no criminal record, and a complete fossil fuel ban after 2040 even with a substantial reduction in economic growth.” The vignette design can simply be viewed as a type of a conjoint experiment, since it shares most of the key design elements with table-based conjoint experiments that we have assumed in our discussion so far. However, there are a few important reasons to prefer the tabular presentation

28

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

of attributes in many cases. First, it can be more difficult to randomize the order of attributes in a vignette experiment, since certain changes might cause the text to become incoherent due to grammatical and sentence structure issues. Second, Hainmueller et al. (2015) show empirically that, at least in their validation study, vignette designs tend to perform less well than tabular conjoint designs, and they also find evidence suggesting that the performance advantage for tabular conjoint designs is due to increased engagement with the survey. Specifically, they find that the effects estimated from a vignette design are consistently attenuated towards zero (while maintaining the directions) compared to the estimates from an otherwise identical tabular conjoint experiment. Vignettes may also heighten respondent fatigue and so reduce the number of tasks respondents are able to complete without excessive satisficing. That being said, certain research questions might naturally call for a vignette format, and the analytical framework discussed below is directly applicable to fully randomized vignette designs as well.

2.5 Analyzing Data from Conjoint Survey Experiments In this section, we provide an overview of the common statistical framework for the causal analysis of conjoint survey data. Much of the theoretical underpinning for the methodology comes directly from the literature on potential outcomes and randomized experiments (e.g., Imbens and Rubin 2015). We refer readers to Hainmueller et al. (2014) for a more formal treatment of the materials here. A key quantity in the analysis of conjoint experiments is the AMCE, a causal estimand first defined by Hainmueller et al. (2014) as a quantity of interest. Our discussion below thus focuses on what the AMCE is, how it can be estimated, and how to interpret it. Given the interest among political scientists in using conjoint experiments to study elections, it is worth highlighting that the AMCE, when applied to elections, is directly interpretable

as a causal effect on a candidate’s or party’s expected vote share. For a more detailed discussion of the AMCE in this context, see Bansak et al. (2020). 2.5.1 Motivation, Definition, and Estimation As we discussed in the previous section, the fully randomized conjoint design is a particular instance of a full factorial design, where each of the attributes can be thought of as a multi-valued factor (or a “treatment component” in our terminology). This enables us to analyze conjoint survey data as data arising from a survey experiment with multiple randomized categorical treatments, to which we can apply a standard statistical framework for causal inference such as the potential outcomes framework.3 From this perspective, the analysis of conjoint survey data is potentially straightforward, for the average treatment effect (ATE) of any particular combination of the treatment values against another can be unbiasedly estimated by simply calculating the difference in the means of the observed outcomes between the two groups of responses that were actually assigned those treatment values in the data. For example, in our Democratic primary experiment, we might consider estimating the ATE of a 61-yearold straight White female former business executive with no prior military service or experience in elected office who supports government-provided healthcare for all Americans, creating pathway to citizenship for all unauthorized immigrants with no criminal record, and imposing a tax on fossil fuels, versus a 37-year-old gay Latino male former lawyer turned state legislator with no military service who supports the same positions on healthcare, unauthorized immigrants, and climate change. 3 Alternatively, one can also apply more traditional analytical tools for factorial designs developed in the classical DOE literature. As discussed above, this is the more common approach in marketing science. On the other hand, the causal inference approach described in the rest of this section has been by far the most dominant methodology in recent applications of conjoint designs in political science.

Conjoint Survey Experiments

Thinking through this example immediately makes it apparent that this approach has several problems. First, substantively, researchers rarely have a theoretical hypothesis that concerns a contrast between particular pairs of attribute value combinations when their conjoint table includes as many attributes as in our experiment. Instead, researchers employing a conjoint design are typically primarily interested in estimating the effects of individual attributes, such as the effect of gender, while allowing respondents also to consider explicitly other attributes that might affect their evaluations of the hypothetical candidates. In other words, a typical quantity of interest in conjoint survey experiments is the overall effect of a given attribute averaged across other attributes that also appear in the conjoint table. Second, statistically, estimating the effect of a particular combination of attribute values against another based on a simple difference in means requires an enormous sample size, since the number of possible combinations of attribute values is very large compared to the number of actual observations. In our experiment, there were 5 × 2 × 2 × 3 × 6 × 4 × 6 × 3 × 3 × 3 = 233,280 possible unique profiles, whereas our observed data contained only 30 × 503 = 15,090 sampled profiles. This implies that observed data from a fully randomized conjoint experiment are usually far too sparse to estimate the ATEs of particular attribute combinations for the full set of attributes included in the study. For these reasons, researchers instead focus on an alternative causal quantity called the AMCE in most applications of conjoint survey experiments in political science. The AMCE represents the effect of a particular attribute value of interest against another value of the same attribute while holding equal the joint distribution of the other attributes in the design, averaged over this distribution as well as the sampling distribution from the population. This means that an AMCE can be interpreted as a summary measure of the overall effect of an attribute after taking into account the possible effects of the other attributes by averaging over the effect variations caused

29

by them. For example, suppose that one is interested in the overall effect on the rating outcome measure of a candidate being female as opposed to male in our Democratic primary experiment. That is, what is the average causal effect of being a female candidate as opposed to a male candidate on the respondent’s candidate rating when they are also given information about the candidate’s age, race/ethnicity, etc.? To answer this question, one can estimate the AMCE of female versus male by simply calculating the average rating of all realized female candidate profiles, calculating the average rating of all male profiles, and taking the difference between the two averages.4 Put differently, the AMCE is an average over the distribution of other attributes. The same procedure could also be performed with respect to the forced choice outcome measure to assess the average causal effect of being a female candidate as opposed to a male candidate on the probability that a candidate will be chosen. In that case, one can estimate the AMCE of female versus male by calculating the proportion of all realized female candidate profiles that were chosen, calculating the proportion of all male profiles that were chosen, and taking the difference between the two. The fact that the AMCE summarizes the overall average effect of an attribute when respondents are also given information on other attributes is appealing substantively because in reality respondents would often have such information on other attributes when making a multidimensional choice. 2.5.2 Interpretation Figure 2.3 shows estimated AMCEs for each of the 10 attributes included in our 4 The validity of this estimation procedure requires the gender attribute being randomized independently of any other attributes. If the randomization distribution did include dependency between gender and other attributes (e.g., female candidates were made more likely to have prior political experience than male candidates), then the imbalance in those attributes between male and female candidates must be taken into account explicitly when estimating the AMCE. See Hainmueller et al. (2014) for more details.

30

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto Age 37 45 53 61 77 Gender Female Male Sexual Orientation Gay Straight Race White Asian Black Hispanic/Latino Previous Occupation Activist Business executive College professor Doctor High school teacher Lawyer Military Service Did not serve Served in the Army Served in the Marine Corps Served in the Navy Political Experience No prior political experience Big−city Mayor Governor Small−city Mayor State Legislator US Representative US Senator Healthcare Position Medicare Private/Public Option All Public Healthcare Immigration Position No Citizenship Pathway DACA All without Criminal Record Climate Position Promote Renewables Fossil Fuel Tax Fossil Fuel Ban −0.1

0.0

0.1

Effect on probability of support Figure 2.3 Average marginal component effects of candidate attributes in the Democratic primary conjoint experiment (forced choice outcome). DACA = Deferred Action for Childhood Arrivals.

Democratic primary experiment along with their 95% confidence intervals, using the forced choice item as the outcome measure. These AMCEs were estimated via linear regression. Interpreting AMCEs is intuitive. For example, for our opt-in sample of 503 American respondents recruited through MTurk, presenting a hypothetical candidate as straight as opposed to gay increased the probability of respondents choosing the profile as their preferred candidate by about 4 percentage points on average when respondents were also given information about the other nine attributes. Thus, the AMCE represents a causal effect of an

attribute value against another averaged over possible interaction effects with the other included attributes, as well as over possible heterogenous effects across respondents. Despite its simplicity, there are important nuances to keep in mind when interpreting AMCEs that are often neglected in applied research. First, the AMCE of an attribute value is always defined with respect to a particular reference value of the same attribute, or the “baseline” value of the attribute. This is parallel to any other regression model or a standard survey experiment in which a treatment effect always represents the effect of the treatment against the particular control

Conjoint Survey Experiments

condition used in the experiment. Researchers sometimes neglect this feature when analyzing conjoint experiments, as Leeper et al. (2020) point out. Second, an important feature of the AMCE as a causal parameter is that it is always defined with respect to the distribution used for the random assignment of the attributes. That is, the true value of the AMCE, as well as its substantive meaning, also changes when one changes the randomization distribution, unless the effect of the attribute has no interaction with other attributes. For example, as mentioned earlier, we used nonuniform randomization distributions for assigning some of the candidate attributes in our Democratic primary experiment, such as candidates’ sexual orientation. Had we used a uniform randomization for the sexual orientation attribute (i.e., 1/2 straight and 1/2 gay) instead, the AMCE of another attribute (e.g., gender) could have been either larger or smaller than what is reported in Figure 2.3, depending on how the effect of that attribute interacts with that of sexual orientation. This important nuance should always be kept in mind when interpreting AMCEs. Hainmueller et al. (2014) discuss this point in more detail (see also de la Cuesta et al. 2019). As a related point, the AMCE differs in its handling of ties from certain other estimation strategies (such as conditional logistic regression) that have sometimes been applied to analyze forced choices (see also Ganter 2019). Consider an example in which a respondent is required to choose between two profiles with two attributes each, including one attribute of interest that has two levels, each appearing with 0.5 probability. Let’s further assume that respondents care only about the attribute of interest, and they flip a coin to decide if the profiles are identical on that attribute. In 50% of all pairings, the levels of this attribute of interest will indeed be equal in the two profiles. As a consequence, even a respondent who always prefers one level of that attribute of interest will commonly see pairings in which both profiles have that attribute level, and so will be forced to choose against profiles that include their preferred attribute

31

level. Conversely, they will also see some profile pairings in which their preferred attribute level is entirely absent, and so will have to choose the non-preferred attribute level. The upshot is that because of the presence of tied attribute levels, the AMCEs of interest in this example will be −0.5 and 0.5. More generally, in the presence of ties, the AMCEs for forced choice outcomes will be bounded at strictly less than 1 or greater than −1, with the extent of the deviation from −1 and 1 increasing with the probability of ties. The intuition is straightforward: even someone who always prefers female candidates will have to choose a male candidate in any pairing that pits two male profiles against each other. This issue does not emerge when respondents are rating individual profiles. Because ties occur frequently in many real-world settings, retaining ties will typically increase the realism of the conjoint. In some instances, researchers may wish to use constrained randomization to avoid ties altogether or to produce ties deliberately, such as when studying social desirability (e.g., Horiuchi et al. 2019). But when doing so, it is critical to acknowledge that such constrained randomizations change the distribution of attribute levels over which the other AMCEs are defined. Finally, it is worth reiterating that the AMCE represents an average of individuallevel causal effects of an attribute. In other words, for some respondents the attribute might have a large effect and for others the effect might be small or zero, and the AMCE represents the average of these potentially heterogeneous effects. This is no different from most of the commonly used causal estimands in any other experiment, such as the ATE or local ATE. Researchers often care about average causal effects because they provide an important and concise summary of what would happen on average to the outcome if everybody moved from control to treatment (Holland 1986). The fact that the ATE and AMCE average over both the sign and the magnitude of the individual-level causal effects is an important feature of these estimands, because both sign and magnitude

32

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

are important in capturing the response to a treatment. As a case in point, one of the only real-world empirical validations of conjoint experiments of which we are aware finds evidence that AMCEs from a survey experiment do recover the corresponding descriptive parameters in Swiss citizenship elections (Hainmueller et al. 2015; see also Auerbach and Thachil 2018). Moreover, when applied to candidate or party choice, the AMCE can be interpreted as the increase in the average vote share attributed to the presence of a specific attribute level (Bansak et al. 2020). In sum, the AMCE is a highly useful and informative summary of aggregate preferences in multidimensional settings. That said, it does not mean that any average causal effect necessarily tells the whole story. Just as an ATE can hide important heterogeneity in the individuallevel causal effects, the AMCE might also hide such heterogeneity – for example, if the effect of an attribute value is negative for one half of the sample and positive for the other half. In such settings, conditional AMCEs for relevant subgroups might be useful to explore, as we discuss later in this section. Similarly, just as a positive ATE does not necessarily imply that a treatment has positive individual-level effects for a majority of subjects, a positive AMCE does not imply that a majority of respondents prefer the attribute value in question (Abramson et al. 2019). Researchers should accordingly be careful in their choice of language for describing the substantive interpretations of AMCEs as an average causal effect.

2.5.3 More on Estimation and Inference Despite the high dimensionality of the design matrix for our factorial conjoint treatments, the AMCEs in our Democratic primary experiment are reasonably precisely estimated based on 503 respondents, as can be seen from the widths of the confidence intervals in Figure 2.3. Many applied survey experimentalists find this rather surprising, since it seems to run counter to the conventional wisdom of being conservative

in adding treatments to factorial experiments. What is the “trick” behind this? The answer to this question lies in the implicit averaging of the profile-specific treatment effects in the definition of the AMCE. Once we focus on a particular attribute of interest, the remaining attributes become covariates (that also happen to be randomly assigned) for the purpose of estimating the particular AMCE. This implies that those attributes simply add to the infinite list of pretreatment covariates that might also vary across respondents or tasks, which are also implicitly averaged over when calculating the observed difference in means. Thus, a valid inference can be made for the AMCE by simply treating the attribute of interest as if it was the sole categorical treatment in the experiment, although statistical efficiency might be improved by explicitly incorporating the other attributes in the analysis. A straightforward method to incorporate information about all of the attributes in estimating the individual AMCEs for the sake of efficiency is to run a linear regression of the observed outcome on the entire set of attributes, each being “dummied out” with the baseline value set as the omitted category. The estimates presented in Figure 2.3 are based on this methodology instead of individual differences in means. The multiple regression approach has the added benefit of the convenience that one can estimate the AMCEs for all attributes at once, and despite the superficial use of a linear regression model, it requires no functional form assumption by virtue of full randomization.5 Thus, this approach is currently the most popular in applied studies. These estimation methods can be applied to various types of outcome variables – such as binary choices, rankings, and ratings – without modification. For illustration, Figure 2.4 5 The regression model must be modified to contain appropriate interaction terms if the randomization distribution includes dependence across attributes. The estimated regression coefficients must then be averaged over with appropriate weights to obtain an unbiased estimate of the AMCEs affected by the dependence. Details are provided by Hainmueller et al. (2014).

33

Conjoint Survey Experiments Age 37 45 53 61 77 Gender Female Male Sexual Orientation Gay Straight Race White Asian Black Hispanic/Latino Previous Occupation Activist Business executive College professor Doctor High school teacher Lawyer Military Service Did not serve Served in the Army Served in the Marine Corps Served in the Navy Political Experience No prior political experience Big−city Mayor Governor Small−city Mayor State Legislator US Representative US Senator Healthcare Position Medicare Private/Public Option All Public Healthcare Immigration Position No Citizenship Pathway DACA All without Criminal Record Climate Position Promote Renewables Fossil Fuel Tax Fossil Fuel Ban −0.50

−0.25

0.00

0.25

0.50

Effect on rating

Figure 2.4 Average marginal component effects of candidate attributes in the Democratic primary conjoint experiment (rating outcome).

shows estimated AMCEs for each of the 10 attributes included in our Democratic primary experiment, along with their 95% confidence intervals, using the seven-point scale rating instead of the forced choice item as the outcome measure. In this application, the estimated AMCEs from the rating outcome are similar to those from the forced choice outcome. Such a similar pattern between the two types of outcomes is frequently, but not always, observed in our experience with conjoint experiments. 2.5.4 Conditional AMCE Another common quantity of interest in conjoint applications in political science

is the conditional AMCE, or the AMCE for a particular subgroup of respondents defined based on a pretreatment respondent characteristic (Hainmueller et al. 2014; also see Chapter 13 in this volume). In our Democratic primary experiment, a natural question of substantive interest is whether preferences about hypothetical Democratic nominees might differ depending on respondents’ partisanship. To answer this question, we analyze the conditional AMCEs of the attributes by estimating the effects for different respondent subgroups based on their partisanship. Figure 2.5 shows the estimated conditional AMCEs for Democratic, independent, and Republican respondents, respectively. As we

34

Democrats

Independents/Other

Republicans

Age 37 45 53 61 77 Gender Female Male Sexual Orientation Gay Straight Race White Asian Black Hispanic/Latino Previous Occupation Activist Business executive College professor Doctor High school teacher Lawyer Military Service Did not serve Served in the Army Served in the Marine Corps Served in the Navy Political Experience No prior political experience Big−city Mayor Governor Small−city Mayor State Legislator US Representative US Senator Healthcare Position Medicare Private/Public Option All Public Healthcare Immigration Position No Citizenship Pathway DACA All without Criminal Record Climate Position Promote Renewables Fossil Fuel Tax Fossil Fuel Ban −0.2

−0.1

0

0.1

0.2

−0.2

−0.1

0

0.1

0.2

−0.2

−0.1

0

Effect on probability of support Figure 2.5 Conditional average marginal component effects of candidate attributes across respondent party.

0.1

0.2

Conjoint Survey Experiments

anticipated, the AMCEs for the policy position attributes are highly variable depending on whether a respondent is a Democrat or a Republican. For example, among Democrats, the probability of supporting a candidate increases by 19 percentage points on average when the position on healthcare changes from supporting Medicare to supporting government healthcare for all. There is no such effect among Republican respondents. There is a similar asymmetry for the effect of the position on immigration. Among Democrats, the probability of supporting a candidate increases by 18 percentage points on average when the position on immigration changes from supporting no pathway to citizenship for undocumented immigrants to supporting a pathway for all undocumented immigrants without a criminal record. Among Republicans, a similar change in the candidate’s immigration position leads to a 11 percentage point decrease in support. Respondents of different partisanship also exhibit preferences in line with their distinct electoral contexts. For example, while prior political experience of the candidate increases support among Democratic respondents on average compared to candidates with no experience in elected office, there is no such effect among Republican respondents. In interpreting conditional AMCEs, researchers should keep in mind the same set of important nuances and common pitfalls as they do when analyzing AMCEs. That is, they represent an average effect of an attribute level against a particular baseline level of the same attribute, given a particular randomization distribution. In addition, researchers need to exercise caution when comparing a conditional AMCE against another. This is because the difference between two conditional AMCEs does not generally represent a causal effect of the conditioning respondent-level variable, unless the variable itself was also randomly assigned by the researcher. For example, in the Democratic primary experiment, the AMCE of a candidate supporting government healthcare for all as opposed to Medicare was 18 percentage points larger for Democrats than for Republican

35

respondents, but it would be incorrect to describe this difference as a causal effect of partisanship on respondents’ preferences for all public healthcare. This point is, of course, no different from the usual advice for interpreting heterogeneous causal effects (such as conditional ATEs) when subgroups are defined with respect to nonrandomized pretreatment covariates, though it is often overlooked in interpreting conditional AMCEs in conjoint applications (see Bansak 2020; Leeper et al. 2020).

2.6 Applications of Conjoint Designs in Political Science As discussed earlier in this chapter, a key factor behind the popularity of conjoint experiments in political science is their close substantive fit with key political science questions. Indeed, conjoint designs have been applied to understand how populations weigh attributes when making various multidimensional political choices, such as voting, assessing immigrants, choosing neighborhoods and housing (Hankinson 2018; Mummolo and Nall 2016), judging climate-related policies (Bechtel et al. 2019; Gampfer et al. 2014; Stokes and Warshaw 2017), publication decisions (Berinsky et al. 2019), and various other problems (Auerbach and Thachil 2018; Ballard-Rosa et al. 2017; Bechtel and Scheve 2013; Bernauer and Nguyen 2015; Gallego and Marx 2017; Hemker and Rink 2017; Sen 2017). In Table 2.2, we report the distribution of 124 recent conjoint applications by their broad topical areas published between 2014 and 2019.6 A plurality of 27% of the applications involve voting and candidate choice. But conjoint designs have been deployed to understand how people collectively weigh different attributes in a wide range of other applications, from politically relevant judgments about individuals to choices among different policy bundles. In the rest 6 Specifically, we reviewed all published articles citing Hainmueller et al. (2014) and classified all that included a conjoint experiment.

36

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

Table 2.2 Topical classification of the 124 published articles using conjoint designs identified in our literature review for the years 2014–2019. Topic Voting Public opinion Public policy Immigration Government Climate change Representation International relations Partisanship Other

Percentage (%) 27 19 6 6 6 6 5 5 4 17

of this section, we review several key areas of conjoint applications in more detail. 2.6.1 Voting While some classic theoretical models examine political competition over a single dimension (Downs 1957), choosing between real-world candidates and parties almost always requires an assessment of tradeoffs in aggregate. Conjoint designs are especially well suited to study how voters make those trade-offs. It is no surprise, then, that candidate and party choice is among the most common applications of conjoint designs (Abrajano et al. 2015; Aguilar et al. 2015; Carnes and Lupu 2016; CrowderMeyer et al. 2020; Franchino and Zucchini 2015; Horiuchi et al. 2018; Kirkland and Coppock 2018; Teele et al. 2018). One especially common use of conjoint designs has been to examine biases against candidates who are from potentially disadvantaged categories including women, African Americans, and working-class candidates. Crowder-Meyer et al. (2020), for example, demonstrate that biases against Black candidates increase when MTurk respondents are cognitively taxed. This study also illustrates another advantage of conjoint designs, which is that they permit the straightforward estimation of differences in the causal effects or AMCEs across other

randomly assigned variables. Those other variables can either be separate attributes within the conjoint or else randomized interventions external to the conjoint itself. An example of the former would be analyzing the difference in AMCEs across the levels of another randomized attribute, while an example of the latter would be analyzing the difference in AMCEs when the framing of the conjoint task itself varies. As designs of the latter type demonstrate, conjoint analyses can not only provide measures of preferences about multidimensional objects, but also be used to evaluate the effect of separate, randomized interventions (see especially Butler and Crabtree 2017; Dill and Schubiger 2019). At the same time, conjoint designs can help explain observed biases even when uncovering no outright discrimination. Carnes and Lupu (2016) report conjoint experiments from Britain, the USA, and Argentina showing that voters do not penalize workingclass candidates in aggregate, a result that suggests that the shortage of working-class politicians is driven by supply-side factors. Also, Teele et al. (2018) use conjoint designs to show that American voters and officials do not penalize – and may even collectively favor – female candidates. Yet they also prefer candidates with traditional family roles, setting up a “double bind” for female candidates.7 Conjoint designs can also be employed to gauge the associations between attributes and a category of interest (Bansak et al. 2019). For example, Goggin et al. (2019) use conjoint experiments embedded in the Cooperative Congressional Election Study to have respondents guess at candidates’ party or ideology using issue priorities and biographical information. They find that low-knowledge and high-knowledge voters alike are able to link issues with parties and ideology, providing grounds for guarded optimism about voters’ capacity to link parties with their issue positions. 7 Not only do conjoint designs have the potential to reduce social desirability biases in some instances, they also facilitate the study of heterogeneous treatment effects relative to alternative approaches such as list experiments.

Conjoint Survey Experiments

Candidate traits, by contrast, do not provide sufficient information to allow most voters to distinguish the candidates’ partisanship. Conjoint designs have thus helped shed new light on long-standing questions of ideology and constraint. Still other uses of conjoint designs can illuminate aspects of voter decision-making and political psychology. For example, ongoing research by Ryan and Ehlinger (2019) examines a vote choice setup in which candidates takes positions on issues whose importance to the respondents had been identified in a previous wave of a panel survey. And separate research by Bakker et al. (2019) deploys conjoint methods to show that people low in the psychological trait “agreeableness” respond positively to candidates with antiestablishment messages. Conjoint designs can also shed light on how political parties choose which candidates to put before voters in the first place (Doherty et al. 2019). Researchers can also use conjoint designs to examine the interactions among different attributes as well as the effects of specific clusters of attributes.

2.7 Immigration Attitudes Whether hiring, dating, or just striking up a conversation, people evaluate other people constantly. That may be one reason why conjoint designs evaluating choices about individuals have proven to be relatively straightforward – and often even engaging – for many respondents. Indeed, we commonly find that respondents seem to enjoy and engage with conjoint surveys, perhaps because of their novelty. In response to one of the experiments done for Bansak et al. (2019), a respondent wrote, “This survey was different than others I have taken. I enjoyed it and it was easy to understand.” An MTurk respondent wrote, “Thank you for the fun survey!” Such levels of engagement may help explain some of the robustness of conjoint experiments we detail above. Given how frequently people find themselves evaluating other people, it is

37

not surprising that conjoint experiments have been used extensively to evaluate immigration attitudes (Adida et al. 2017; Auer et al. 2019; Bansak et al. 2016; Clayton et al. 2019; Flores and Schachter 2018; Hainmueller and Hopkins 2015; Schachter 2016; Wright et al. 2016). Hainmueller and Hopkins (2015) demonstrate that American respondents recruited via a probability sample actually demonstrate surprising agreement on the core attributes that make immigrants to the USA more or less desirable. Wright et al. (2016) show that sizable fractions of American respondents choose not to admit either immigrant when they are presented in pairs and there is the option to reject both. 2.7.1 Policy Preferences Another area where conjoint experiments have been employed is to examine voters’ policy preferences. In these applications, respondents are often confronted with policy packages that vary on multiple dimensions. Such designs can be used to examine the trade-offs that voters might make between different dimensions of the policy and to examine the impacts of changing the composition of the package. For example, Ballard-Rosa et al. (2017) use a conjoint survey to examine American income tax preferences by presenting respondents with various alternative tax plans that vary the level of taxation across six income brackets. They find that voter opinions are not far from current tax policies, although support for taxing the rich is highly inelastic. Bansak et al. (2021) employ a conjoint experiment to examine mass support in European countries for national austerity packages that vary along multiple types of spending cuts and tax increases, allowing them to evaluate eligible voters’ relative sensitivities to different austerity measures as well as to estimate average levels of support for specific hypothetical packages. One feature of these studies is that the choice task (i.e., evaluating multidimensional policy packages) is presumably less familiar and more complex for respondents than the

38

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

task of evaluating people or political candidates. That said, many real-world policies involve precisely this type of multifeatured complexity, and preferences of many voters vis-à-vis these policies might well be highly contingent. For example, respondents might support a Brexit plan only if it is based on a negotiated agreement with the European Union. Similarly, during its 2015 debt crisis, Greece conducted a bailout referendum in which voters were asked to decide whether the country should accept the bailout conditions proposed by the international lenders. 2.7.2 Challenges and Open Questions Still, there are a range of outstanding questions about conjoint survey experiments. For example, a central challenge in designing conjoint experiments is the possibility of producing unrealistic profiles. Fully randomized conjoint designs have desirable features, but one limitation is that the independent randomization of attributes that are in reality highly correlated may produce profiles that seem highly atypical. To some extent, this is a feature rather than a bug: it is precisely by presenting respondents with atypical profiles that it is possible to disentangle the specific effects of each attribute. While in 2006 it might have seemed unlikely that the next US president would be the son of a White mother from Kansas and a Black father from Kenya and someone who spent time in Indonesia growing up, Barack Obama was inaugurated just a few years later. In some instances, however, atypical or implausible profiles are genuine problems, which can be addressed through various approaches. For one thing, researchers can modify the incidence of different attributes to reduce the share of profiles that are atypical. They can also place restrictions on attribute combinations or can draw two seemingly separate attributes jointly. For example, if the researchers want to rule out the possibility of a candidate profile of a very liberal Republican, they can simply draw ideology and partisanship jointly from a set of options that excludes that combination.

Finally, researchers can also identify profiles as atypical after the fact and then examine how the AMCEs vary between profiles that are more or less typical, as in Hainmueller and Hopkins (2015). In practice, the atypical profiles are randomly assigned, so researchers can straightforwardly compare AMCEs between profiles or tasks judged to be atypical and others. There are also outstanding questions about external validity. To date, conjoint designs have been administered primarily via tables with written attribute values, even though information about political candidates or other choices is often processed through visual, aural, or other modes. Do voters, for example, evaluate written attributes presented in a table in the same way that they evaluate attributes presented in more realistic ways? The table-style presentation may prompt respondents to evaluate the choice in different ways, and so hamper external validity. It also has the potential to lead respondents to consider each attribute separately, rather than assessing the profile holistically. One core benefit of conjoint designs can also be a liability in some instances. Conjoint designs return many possible quantities of interest, allowing researchers to compare the AMCEs for various effects and to test hypotheses competitively. However, this also opens up the possibility of multiple comparisons concerns, as researchers may conduct multiple statistical tests. This feature of conjoint designs makes preregistration and preanalysis plans especially valuable in this context. At the same time, conjoint experiments open up a wide range of new substantive and statistical questions about the interactions across different attributes, questions that researchers have only begun to probe.

References Abrajano, Marisa A., Christopher S. Elmendorf, and Kevin M. Quinn. 2015. “Using Experiments to Estimate Racially Polarized Voting.” UC Davis Legal Studies Research Paper Series, No. 419.

Conjoint Survey Experiments Abramson, Scott F., Korhan Koçak, and Asya Magazinnik. 2019. “What Do We Learn about Voter Preferences from Conjoint Experiments?” Working paper presented at PolMeth XXXVI. Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2018. “Analyzing causal mechanisms in survey experiments.” Political Analysis 26(4): 357–378. Adamowicz, Wiktor, Peter Boxall, Michael Williams, and Jordan Louviere. 1998. “Stated preference approaches for measuring passive use values: choice experiments and contingent valuation.” American Journal of Agricultural Economics 80(1): 64–75. Adida, Claire L, Adeline Lo, and Melina Platas. 2017. “Engendering empathy, begetting backlash: American attitudes toward Syrian refugees.” Stanford–Zurich Immigration Policy Lab Working Paper No. 17-01. Aguilar, Rosario, Saul Cunow, and Scott Desposato. 2015. “Choice sets, gender, and candidate choice in Brazil.” Electoral Studies 39: 230–242. Auer, Daniel, Giuliano Bonoli, Flavia Fossati, and Fabienne Liechti. 2019. “The matching hierarchies model: evidence from a survey experiment on employers’ hiring intent regarding immigrant applicants.” International Migration Review 53(1): 90–121. Auerbach, Adam Michael, and Tariq Thachil. 2018. “How clients select brokers: Competition and choice in India’s slums.” American Political Science Review 112(4): 775–791. Bakker, Bert N., Gijs Schumacher, and Matthijs Rooduijn. 2019. “The Populist Appeal: Personality and Anti-establishment Communication.” Working paper, University of the Netherlands. Ballard-Rosa, Cameron, Lucy Martin, and Kenneth Scheve. 2017. “The structure of American income tax policy preferences.” The Journal of Politics 79(1): 1–16. Bansak, Kirk. 2020. “Estimating Causal Moderation Effects with Randomized Treatments and Non-Randomized Moderators.” Journal of the Royal Statistical Society: Series A. Forthcoming. Bansak, Kirk, Jens Hainmueller, Daniel J Hopkins, and Teppei Yamamoto. 2018. “The number of choice tasks and survey satisficing in conjoint experiments.” Political Analysis 26(1): 112–119. Bansak, Kirk, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto. 2019. “Beyond the breaking point? Survey satisficing in conjoint experiments.” Political Science Research and Methods. URL: https://doi.org/10.1017/psrm .2019.13

39

Bansak, Kirk, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto. 2020. “Using Conjoint Experiments to Analyze Elections: The Essential Role of the Average Marginal Component Effect (AMCE) (May 13, 2020).” URL: https://ssrn.com/abstract=3588941 Bansak, Kirk, Jens Hainmueller, and Dominik Hangartner. 2016. “How economic, humanitarian, and religious concerns shape European attitudes toward asylum seekers.” Science 354(6309): 217–222. Bansak, Kirk, Michael M. Bechtel, and Yotam Margalit. 2021. “Why Austerity? The Mass Politics of a Contested Policy.” American Political Science Review. Forthcoming. Bechtel, Michael M., Federica Genovese, and Kenneth F. Scheve. 2019. “Interests, norms, and support for the provision of global public goods: The case of climate cooperation.” British Journal of Political Science 49(4): 1333–1355. Bechtel, Michael M., and Kenneth F. Scheve. 2013. “Mass support for global climate agreements depends on institutional design.” Proceedings of the National Academy of Sciences of the United States of America 110(34): 13763–13768. Berinsky, Adam J., James N. Druckman, and Teppei Yamamoto. 2019. “Publication Biases in Replication Studies.” Working paper, Massachusetts Institute of Technology. Bernauer, Thomas, and Quynh Nguyen. 2015. “Free trade and/or environmental protection?” Global Environmental Politics 15(4): 105–129. Butler, Daniel M., and Charles Crabtree. 2017. “Moving beyond measurement: Adapting audit studies to test bias-reducing interventions.” Journal of Experimental Political Science 4(1): 57–67. Carnes, Nicholas, and Noam Lupu. 2016. “Do voters dislike working-class candidates? Voter biases and the descriptive underrepresentation of the working class.” American Political Science Review 110(4): 832–844. Clayton, Katherine, Jeremy Ferwerda, and Yusaku Horiuchi. 2019. “Exposure to immigration and admission preferences: Evidence from France.” Political Behavior. URL: https://doi .org/10.1007/s11109-019-09550-z Cox, David R. 1958. Planning of Experiments. New York: John Wiley. Crowder-Meyer, Melody, Shana Kushner Gadarian, Jessica Trounstine, and Kau Vue. 2020. “A different kind of disadvantage: Candidate race, cognitive complexity, and voter choice.” Political Behavior 42: 509–530.

40

Kirk Bansak, Jens Hainmueller, Daniel J. Hopkins, and Teppei Yamamoto

Dafoe, Allan, Baobao Zhang, and Devin Caughey. 2018. “Information equivalence in survey experiments.” Political Analysis 26(4): 399–416. de la Cuesta, Brandon, Naoki Egami, and Kosuke Imai. 2019. “Improving the External Validity of Conjoint Analysis: The Essential Role of Profile Distribution.” Working paper presented at PolMeth XXXVI. Dill, Janina, and Livia I. Schubiger. 2019. “Attitudes towards the Use of Force: Instrumental Imperatives, Moral Principles, and International Law.” Working paper, Oxford University. Doherty, David, Conor M. Dowling, and Michael G. Miller. 2019. “Do local party chairs think women and minority candidates can win? Evidence from a conjoint experiment.” The Journal of Politics 81(4): 1282–1297. Downs, Anthony. 1957. An Economic Theory of Democracy. New York: Harper. Druckman, James N., Donald P. Green, James H. Kuklinski, and Arthur Lupia. 2011. Cambridge Handbook of Experimental Political Science. Cambridge, UK: Cambridge University Press. Egami, Naoki, and Kosuke Imai. 2019. “Causal interaction in factorial experiments: Application to conjoint analysis.” Journal of the American Statistical Association 114(526): 529–540. Flores, René D., and Ariela Schachter. 2018. “Who are the ‘Illegals’? The social construction of illegality in the United States.” American Sociological Review 83(5): 839–868. Franchino, Fabio, and Francesco Zucchini. 2015. “Voting in a multi-dimensional space: A conjoint analysis employing valence and ideology attributes of candidates.” Political Science Research and Methods 3(2): 221–241. Gallego, Aina, and Paul Marx. 2017. “Multidimensional preferences for labour market reforms: a conjoint experiment.” Journal of European Public Policy 24(7): 1027–1047. Gampfer, Robert, Thomas Bernauer, and Aya Kachi. 2014. “Obtaining public support for North-South climate funding: Evidence from conjoint experiments in donor countries.” Global Environmental Change 29: 118–126. Ganter, Flavien. 2019. “Revisiting Causal Inference in Forced-Choice Conjoint Experiments: Identifying Preferences Net of Compositional Effects.” URL: https://osf.io/preprints/ socarxiv/e638u/ (accessed December 18, 2019). Goggin, Stephen N., John A. Henderson, and Alexander G. Theodoridis. 2019.

“What goes with red and blue? Mapping partisan and ideological associations in the minds of voters.” Political Behavior. URL: https://doi.org/10.1007/S11109-018-09525-6 Green, Paul E., and Vithala R. Rao. 1971. “Conjoint measurement for quantifying judgmental data.” Journal of Marketing Research VIII 355–363. Hainmueller, Jens, and Daniel J. Hopkins. 2015. “The hidden american immigration consensus: A conjoint analysis of attitudes toward immigrants.” American Journal of Political Science 59(3): 529–548. Hainmueller, Jens, Daniel J. Hopkins, and Teppei Yamamoto. 2014. “Causal inference in conjoint analysis: Understanding multidimensional choices via stated preference experiments.” Political Analysis 22(1): 1–30. Hainmueller, Jens, Dominik Hangartner, and Teppei Yamamoto. 2015. “Validating vignette and conjoint survey experiments against realworld behavior.” Proceedings of the National Academy of Sciences of the United States of America 112(8): 2395–2400. Hankinson, Michael. 2018. “When do renters behave like homeowners? High rent, price anxiety, and NIMBYism.” American Political Science Review 112(3): 473–493. Hemker, Johannes, and Anselm Rink. 2017. “Multiple dimensions of bureaucratic discrimination: Evidence from German welfare offices.” American Journal of Political Science 61(4): 786– 803. Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American Statistical Association 81(396): 945–960. Horiuchi, Yusaku, Daniel M. Smith, and Teppei Yamamoto. 2018. “Measuring voters’ multidimensional policy preferences with conjoint analysis: Application to Japan’s 2014 election.” Political Analysis 26(2): 190–209. Horiuchi, Yusaku, Zach Markovich, and Teppei Yamamoto. 2019. “Does Conjoint Analysis Mitigate Social Desirability Bias?” Unpublished manuscript. Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge, UK: Cambridge University Press. Jasso, Guillermina, and Peter H. Rossi. 1977. “Distributive justice and earned income.” American Sociological Review 42(4): 639–651. Jenke, Libby, Kirk Bansak, Jens Hainmueller, and Dominik Hangartner. 2020. “Using Eye-Tracking to Understand Decision-Making

Conjoint Survey Experiments in Conjoint Experiments.” Political Analysis, 1–27. doi:10.1017/pan.2020.11. Khanna, Kabir. 2019. “What traits are Democrats prioritizing in 2020 candidates?” CBS News, May 8. URL: www.cbsnews.com/news/ democratic-voters-hungry-for-women-andpeople-of-color-in-2020-nomination/ (accessed July 1, 2020). Kirkland, Patricia A., and Alexander Coppock. 2018. “Candidate choice without party labels.” Political Behavior 40(3): 571–591. Krosnick, Jon A. 1999. “Survey research.” Annual Review of Psychology 50(1): 537–567. Leeper, Thomas J., Sara B. Hobolt, and James Tilley. 2020. “Measuring subgroup preferences in conjoint experiments.” Political Analysis 28(2): 207–221. Loewen, Peter John, Daniel Rubenson, and Arthur Spirling. 2012. “Testing the power of arguments in referendums: A Bradley– Terry approach.” Electoral Studies 31(1): 212–221. Luce, R. Duncan, and John W. Tukey. 1964. “Simultaneous conjoint measurement: A new type of fundamental measurement.” Journal of Mathematical Psychology 1(1): 1–27. Mummolo, Jonathan, and Clayton Nall. 2016. “Why partisans do not sort: The constraints on political segregation.” Journal of Politics 79(1): 45–59. Raghavarao, Damaraju, James B. Wiley, and Pallavi Chitturi. 2011. Choice-Based Conjoint Analysis: Models and Designs. Boca Raton, FL: CRC Press. Ryan, Timothy J., and J. Andrew Ehlinger. 2019. “Issue Publics: Fresh Relevance for an Old

41

Concept.” Working paper presented at the Annual Meeting of the American Political Science Association, August 2019, Washington, DC. Schachter, Ariela. 2016. “From ‘different’ to ‘similar’ an experimental approach to understanding assimilation.” American Sociological Review 81(5): 981–1013. Sen, Maya. 2017. “How political signals affect public support for judicial nominations: Evidence from a conjoint experiment.” Political Research Quarterly 70(2): 374–393. Sniderman, Paul M., and Douglas B. Grob. 1996. “Innovations in experimental design in attitude surveys.” Annual Review of Sociology 22: 377– 399. Stokes, Leah C., and Christopher Warshaw. 2017. “Renewable energy policy design and framing influence public support in the United States.” Nature Energy 2(8): 17107. Teele, Dawn Langan, Joshua Kalla, and Frances Rosenbluth. 2018. “The ties that double bind: Social roles and women’s underrepresentation in politics.” American Political Science Review 112(3): 525–541. Tversky, Amos. 1967. “A general theory of polynomial conjoint measurement.” Journal of Mathematical Psychology 4: 1–20. Wallander, Lisa. 2009. “25 years of factorial surveys in sociology: A review.” Social Science Research 38: 505–520. Wright, Matthew, Morris Levy, and Jack Citrin. 2016. “Public attitudes toward immigration policy across the legal/illegal divide: The role of categorical and attribute-based decisionmaking.” Political Behavior 38(1): 229–253.

CHAPTER 3

Audit Studies in Political Science∗

Daniel M. Butler and Charles Crabtree

Abstract Audit studies typically involve researchers sending a message to or making a request of some sample in order to unobtrusively measure subjects’ behaviors. These studies are often conducted as a way of measuring bias or discrimination. We introduce readers to audit studies, describe their basic design features, and then provide advice on effectively implementing these studies. In particular, we provide several suggestions aimed at improving the internal, ecological, and external validity of audit study findings.

3.1 Introduction Audit studies1 are part of a larger group of field experiments designed to measure, but

* We would like to thank James N. Druckman, Catherine Eckel, Donald P. Green, Matthew Nielsen, and Ariel White for their extremely helpful comments. 1 In the literature, researchers performing audit studies use different terms. Sometimes they use the term “field experiment” to emphasize the fact that the experiment is taking place in the field. This is fine since audit studies are a subset of field experiments. Similarly, the term “correspondence study” is used to emphasize the communication aspect of the particular audit study. This is fine because correspondence studies are a subset of audit studies – ones where a message or correspondence is sent (not all audit studies involve correspondence).

42

not necessarily change, the behavior of subjects in the field.2 The usefulness of these studies depends on whether they measure behavior in an unobtrusive manner (more on this later in the chapter). At their most basic design level, these studies instigate some interaction with the subjects being studied (a request for help, a visit for a job interview, etc.) and then measure how the subjects respond. In audit studies, the researcher sends a communication, varies some aspect of the 2 Non-audit studies that belong in this group include work on censorship that varies the content of social media posts (King et al. 2014), studies varying ad buy requests (Crabtree et al. 2019) and the formation of anonymous shell companies (Findley et al. 2015).

Audit Studies in Political Science

sender (e.g., their race) and/or the content (e.g., a service request versus a policy request), and then measures how the receiver responds. The purpose of these studies is primarily to measure how the sample responds to different aspects of the stimuli. In this regard, audit studies are behavioral variants of survey questions. Audit studies grew in popularity as a means to study discrimination in housing and labor markets (see review in Quillian et al. 2017). The civil rights movement led to the passage of legislation barring discrimination, which was accompanied by an interest in measuring whether discrimination persisted (see discussion in Gaddis 2017). Many of these studies looked at whether racial minorities were treated worse than their White counterparts (Wienk et al. 1979). Because audit studies on racial discrimination have been conducted for many decades, researchers have been able to compare how the treatment of racial minorities has changed (or not changed) over time (e.g., Quillian et al. 2017). While audit studies continue to study the treatment of racial minorities relative to Whites, the approach has also been applied to understand discrimination against many other groups. It has been used to examine discrimination based on one’s gender (Ayres and Siegelman 1995; Butler 2014), age (Ahmed et al. 2012), sexual orientation (Drydakis 2014), religion (Adida et al. 2010; Pfaff et al. 2019), criminal record (Pager 2003), and other attributes (GellRedman et al. 2017; Rivera and Tilcsik 2016; Weichselbaumer 2016). Audit studies have also been widely used by governments as a way to test for discrimination. In the 1960s, the UK parliament created the Race Relations Board, which commissioned several studies, including audit studies, aimed at measuring levels of racial discrimination (Daniel 1968). The tests uncovered discrimination and led to the passage of more laws barring racial discrimination in housing and employment (Smith 2015). In the USA, the US Department of Housing and Urban Development (HUD) conducted several audit studies to measure levels of discrimination in the housing

43

market. In addition to several studies that focused on specific cities (e.g., Johnson et al. 1971; Pearce 1979), HUD commissioned several national audit studies (Turner and James 2015; Wienk et al. 1979; Yinger 1991, 1993). The federal government’s decision to use audit studies to measure discrimination influenced academics by signaling that these studies were reliable, effective ways of measuring discrimination (see discussion in Gaddis 2017). The advantage of audit studies, and all field measurement studies, is that they provide researchers with a measure of how the subjects under study behave. Survey responses can be cheap talk, especially when the topic under investigation involves behavior that is socially unacceptable. Findley and Nielson (2016), for example, conducted follow-up surveys with some of the same companies who had been part of a field measurement study they had done looking at levels of compliance with international law. In the original study (Findley et al. 2015), the authors studied whether the individuals who provide incorporation services (i.e., helped with legal documents to create a company) for anonymous shell companies (i.e., companies that do not have employees but hold financial resources) are abiding by the international agreements related to the formation of these companies. To do so, they posed as consultants seeking to form anonymous shell companies and approached thousands of individuals who provide incorporation, varying where the citizen making the offer came from and whether a premium was offered for confidentially. They found that a large number of providers are willing to provide the service without the required identification documentation. In the follow-up survey, they asked respondents what documentation they would require if someone asked for help in creating an anonymous shell company. They then compared the survey responses to the behavior they observed in their field measurement study. The results of their study show that the survey results overstate the level of compliance with the law. There are two reasons why the survey results understated

44

Daniel M. Butler and Charles Crabtree

the level of bad behavior. First, some of the worst offenders did not complete the survey. Second, those who responded to the survey self-reported higher levels of compliance with the law than what was observed in practice during the audit portion of the study (see also Doherty and Adler 2020; Pager and Quillian 2005). Similar factors are likely to be a concern in any survey on discrimination. First, the people who are most discriminatory may be more likely to opt out because they know that their behavior is wrong or sense that it will be judged as wrong. If we try to draw conclusions based on the people who opt in, we might underestimate the level of bias. Audit studies can mitigate this issue by getting responses from a larger set of the sample of interest. Second, social desirability is probably an issue in these contexts. Discrimination – the focus of many audit studies – is the type of topic that is likely to suffer from social desirability effects. People may simply underreport their own discriminatory behaviors and attitudes in a survey. Audit studies avoid this potential pitfall by looking at individual behavior when people do not realize they are being studied (and thus cannot artificially change their behavior to look less biased to the researcher). The audit study, if well done, captures behavior in action, unaffected by social desirability bias.

3.2 Basics of an Audit Study Most audit studies involve testing for discrimination in how a group responds to some type of request (an email seeking help, a job application, a housing application, etc.). Audit study designs generally involve the following steps: r Identify the question, the population, and the sample. r Develop the instrument(s). r Randomly assign treatments r Send messages. r Measure the outcome by looking at responses.

To illustrate these steps, consider studying whether bureaucratic offices exhibit racial discrimination against Blacks relative to Whites. Contacting a government office with a request for help is just one way that audit studies can be applied. Other applications, as mentioned, include applying for jobs or housing or trying to complete other important quotidian tasks (see review in Gaddis 2017). We use the example of contacting a bureaucratic office because this is representative of an approach commonly used in political science (also see Chapter 27 in this volume). Researchers could study this question by sending email requests for help to different bureaucratic offices and randomizing whether the request comes from someone who is putatively Black or White (e.g., using stereotypical names to signify race). The requests would be identical in all ways other than the race of the requester. The researchers can then measure how the offices respond to the requests to see if they are less (or more) responsive to requests from Blacks. By following these steps, researchers can measure levels of discrimination – if the response rates differ on average, then it suggests discriminatory behavior based on race. In the rest of this section, we highlight a few of the major decisions that go into conducting an audit study. 3.2.1 Be Precise about the Question and the Population and Sample Audit studies are well suited for studying discrimination. We follow Pager and Shepherd (2008) and define discrimination as the difference between how two groups are treated. Discrimination, which involves behavior, is distinct from holding racist attitudes or beliefs (e.g., prejudice). All of these other factors can motivate behavior. Studying discrimination does not presume what is causing the unequal behavior (see discussion in Pager and Shepherd 2008), though a promising, necessary direction of future work would be to examine potential causes. Most audit studies, though not all (e.g., Butler et al. 2012), have focused on

Audit Studies in Political Science

measuring whether subjects are engaging in some form of discrimination. In our running example, the researchers are interested in testing whether bureaucratic offices are discriminating based are race (e.g., are less responsive to Blacks than Whites). An audit study is appropriate for this question because it is focused on the behavior of government workers. It is also important to be precise about the population and sample. In many existing audit studies, researchers contact legislative offices (Butler and Broockman 2011; GellRedman et al. 2018). Even if the researchers use the legislator’s email address, these requests may be dealt with by staff. In other words, these studies are not necessarily about legislators, but rather about legislative offices. These studies speak to the behavior of legislative offices, which can tell us something about how legislators represent their constituents (Salisbury 1981a, 1981b). This is not to say that these studies are not informative; rather, it is important to be precise about who or what we are studying. A related issue is whether the researcher is interested in the behavior of all legislators in the USA or just a specific subset (e.g., state legislators). Often researchers are interested in the behavior of all legislators but choose only to include state legislators in their sample. In many cases, this will be appropriate because all legislators face similar electoral and party pressures (see discussion in Butler and Powell 2014). However, the researcher needs to evaluate this decision on a caseby-case basis: Is it appropriate to generalize from the sample to a larger population? The researcher should be clear about the population that they want to study and ensure that their sample is correct for the question of interest. In our running example, the researchers are interested in learning about how bureaucratic offices deal with requests. 3.2.2 Develop the Instrument (or Message) to Maximize the Likelihood That It Reflects a Commonly Encountered Communication Audit studies are typically used to measure the level of discrimination that individuals

45

face. This is best done by creating an instrument (or message) that people are likely to send. “Instrument” refers to the message that the researcher is sending to the people being studied. In a study of job market discrimination, the instrument might be the resume used to apply for jobs. In our running example, the instrument would be the email message that the researchers send to the bureaucratic offices. It is crucial that the researchers develop the instrument so that the people in the sample being studied approach the communication as they normally would. If the people in the sample suspect that they are being studied, they may behave differently, leading the researcher to make incorrect conclusions. This point is so important that we devote a full section below (Section 3.3.1) to discussing it. Finally, note that an instrument can vary in more than one aspect. Bertrand and Mullainathan (2004), for example, vary both race and qualifications in their experiment looking at how race influences employers’ interest in job applicants. They find that the effect of race varies with the applicant’s qualifications (the bias is worse for higher levels of qualifications). Researchers with theoretical reasons to vary more than one aspect can do so using a factorial design. 3.2.3 Randomly Assign Treatments One decision that researchers have to make is whether they will send just one message to each unit in the sample or multiple messages. Sometimes researchers will use a paired design, where they send each unit in the sample one message for each of the treatments. In our running example, a researcher using a paired design would send two (or more) messages to each bureaucratic office: one from a putatively Black individual and one from a putatively White individual. While this design can be appropriate and can increase statistical power, we generally recommend against it. As we mentioned above, audit studies are most effective when they unobtrusively measure respondents’ behavior. A paired design increases the

46

Daniel M. Butler and Charles Crabtree

likelihood that the experiment may be discovered, which, as we discuss later in the chapter, hurts and potentially fatally compromises the usefulness of the study. In our running example, the researcher would randomly assign each office to receive a message from either a putatively Black individual or a putatively White individual. 3.2.4 Sending Messages That Hold Confounding Factors Constant Early audit studies involved having actors from different racial groups apply for jobs or housing (e.g., Pager 2003). The actors would apply in person and the researchers would see whether the racial minorities were treated differently. One concern about these studies is that the White and racial minority actors likely differ in systematic ways. If these differences are also relevant to the hiring or housing decision, then it is possible that these confounding factors might be responsible for the differential treatment. To avoid this potential criticism, researchers would identify people who were similar to begin with, and they would train auditors to respond in similar ways. The goal was to minimize any potential confounding characteristics. However, because it is nearly impossible to deal with all potential confounders, including the fact that the auditors (i.e., the actors) knew the treatment, skeptics raised concerns that any measures of bias were inaccurate (e.g., even if their resumes were virtually identical, their in-person behaviors likely differed) (Heckman 1998). In response to these criticisms, researchers have transitioned to sending messages by mail or email.3 This allows them to send messages that are identical except in ways that researchers intentionally manipulate. Returning to our running example, researchers might send email 3 This many not always be possible (e.g., in some poorer localities, this may not be how people communicate) and/or in some cases the transaction needs to be done in person. Researchers can still conduct audit studies in such cases, but they should do their best to minimize the possibility that other factors are confounded with the treatment.

messages to bureaucratic offices that are the same in every way except in the name of the sender, which signals the sender’s race, gender, or other ascriptive attributes. 3.2.5 Measuring Outcomes Audit studies have a relatively clear interpretation, which makes them an attractive tool for studying how officials treat individuals from various groups.4 Many of the audit studies in political science have been used to compare how public officials treat different groups of individuals (see the review in Costa 2017). The most common outcome is to look at whether a response is given to the sender’s request or application. People sending the messages generally want a response, and so researchers can look at whether they receive one. This is generally a good outcome to report when performing an audit study, as it is the most basic outcome and can be compared with previous audit studies. Researchers can also look at the quality of the response (the length of the response, the friendliness, whether the requested information was provided, etc.). Researchers who look at these outcomes must be careful to avoid the bias that comes from conditioning on whether the original message received a 4 An open question is the degree to which audit studies can speak to how well politicians substantively represent different groups of constituents. Political science is broadly interested in questions related to the equality of how groups are treated. Numerous studies on representation, for example, try to answer whether politicians give preference to one group over another. However, studying whether politicians represent a group’s opinion on the issues better than they represent another group’s opinion can be difficult (Matsusaka 2001). One of the attractive features of an audit study is that it is a straightforward way to measure whether the politicians are being equally responsive. However, responding to an email or letter is not the same as taking action on a bill. The degree to which audit studies can speak to broader inequalities in representation is an important avenue for future research. In the one study that looks at this question, Mendez and Grose (2018) responsive to requests from minority constituents correlates with a legislator being more likely to favor stricter voter identification laws. More research should be done to see whether audit studies can be used to study broader patterns of equality in representation. At the current time, the evidence is not sufficient to speak to this question.

Audit Studies in Political Science

response (Hemker and Rink 2017). Whether a response was given is a post-treatment variable because it comes after the individual being studied has been exposed to their assigned treatment. For example, it might be that officials who are hurried for time are willing to provide a quick response to an email from a White individual but not a Black individual. Perhaps these quick responses would only be of medium quality, because the official was rushed, but they would still be better than no response. If, in this hypothetical case, the researcher limited the sample to people who responded, they could easily conclude that the White individuals received lower-quality responses because these quick responses bring the average quality down. While this example is hypothetical, the more general point is that conditioning on a posttreatment variable can introduce serious bias in unknown directions and should thus be avoided (Montgomery et al. 2018). Coppock (2019) outlines three options for identifying unbiased estimates related to the content of the email. Here, we discuss the approach that we believe will be best for most applications: redefining the outcome to avoid conditioning on whether a response was received. Redefining the outcome is straightforward and results in a dependent variable that is easy to interpret. When taking this approach, the researcher would redefine the outcome to be an indicator variable that is coded as 1 if it meets some outcome and 0 otherwise. Returning to our running example, it might be the case that the researcher has coded whether a response from the bureaucratic office answered the question. The redefined outcome could be: Does the office send a response that answered the question? It would be coded as 1 if the response answered the question and 0 if the response did not answer the question or no response was sent.

3.3 Maximizing Internal, Ecological, and External Validity Researchers should aim to maximize the ecological validity of their study, or the

47

extent to which it approximates real-world interactions. They can help achieve this goal by ensuring the realism of their instrument. Specifically, they should design an instrument that avoids raising study subjects’ suspicions. If the study population suspects that the message they receive is not typical, they might doubt the identity of the sender or the purpose of the communication. For instance, if a law enforcement agency received a request about becoming a police officer from an individual that describes themselves as “Black” three times in an email, the individual reading the email might consider this unusual behavior and suspect that the message was testing how they would reply to a putatively Black citizen. These doubts could change how they respond, potentially biasing the results away from the very thing the researchers want to learn from a study like this. For example, perhaps subjects normally responds more to Whites than Blacks. However, if they suspect that researchers are studying their behavior, they might be more careful in responding to communications from Blacks, to the point where they respond more to Blacks than Whites. This could lead researchers to mistakenly conclude that Blacks receive better, not worse treatment.5 Researchers should also maximize the realism of their instrument related to what they want to learn from their audit studies. Typically, researchers conduct audit studies because they want to learn about how the average member of a group is treated in some interaction. If the study population suspects that the message that they receive is not typical, they might think that the individual sending it is also atypical in some way. This could cause them to treat the sender

5 Another way of thinking about this issue is in terms of social desirability bias. When people know they are being surveyed, they underreport discriminatory attitudes. If people know they are being audited, they will similarly adjust their behavior on those specific communications to appear less discriminatory. The whole advantage of an audit study is to avoid this type of social desirability bias. If subjects believe that the message they receive is not genuine, then this advantage is lost.

48

Daniel M. Butler and Charles Crabtree

differently from the average member of the sender’s group. For example, a researcher pretending to be a parent might send emails to a sample of principals that are unusually impolite. The principals might infer from the language used in the emails that the putative parent is entitled, bossy, or possesses some other negative personality attribute(s). As a result, they might reply differently from how they would to a message that was more polite and deferential. 3.3.1 Include Typical Requests in the Instrument One aspect that affects the realism of the instrument is the type of request(s) included in the instrument. Before determining what request(s) to make of the study population, researchers should ensure that these request(s) are similar to the ones that their subjects usually receive. They can do this in several ways. One approach is to conduct qualitative interviews with members of the study population about the type of interactions that they have with the public (Terechshenko et al. 2019).6 Researchers conducting audit studies to examine how offices (legislative, bureaucratic, etc.) behave might also use requests that appear in the frequently asked questions sections of office websites. When the study population consists of public offices or officials but neither of these two approaches is possible, researchers might want to consider issuing Freedom of Information Act (FOIA) requests for all messages received by the study population.7 After receiving these text corpora, researchers could use machine learning tools to summarize them and to construct typical requests (Grimmer and Stewart 2013).

6 Researchers might then want to exclude the individuals they interview from their study, as they might be more likely to think that the instrument was sent from a researcher. 7 Researchers could also couple their FOIA requests with an audit study design. Lagunes and Pocasangre (2019) present a good example of how this could be done.

3.3.2 Use Different Aliases In the majority of audit studies, researchers create a set of fictitious identities and use these to send messages. Sometimes researchers use these identities to detect discrimination, sometimes they use them to conceal the fact that the messages come from researchers. The names that researchers use with these identities should be carefully selected. This is because each name signals a number of things about the identity of the sender. In the interests of maximizing the believability of the messages, researchers should select names that are typical among members of the individual identity’s group. When researchers do this, they should check how the names they use are perceived because perceptions of names vary across study populations based on subject race, education, and geography (e.g., Crabtree and Chykina 2018). Researchers have two options for dealing with this potential problem. One is that they can use names that have been tested by other researchers. For example, Hughes et al. (2019) provide in their appendix a list of the names that they used, along with the results of a survey they conducted about popular ethnic perceptions of these names. Researchers can also pretest their own names through platforms such as Amazon’s Mechnical Turk. In the context of our running example, one could do this by selecting a large bundle of names and asking Mechnical Turk workers (MTurkers) to assess the likelihood that the name belongs to a Black or White individual.8 The results from this exercise would allow researchers to select the names that are most strongly associated with Black or White individuals. Regardless of the approach researchers use to expand their battery of names, we suggest that they pick names that are similarly 8 One potentially useful source for names is birth certificate or US census data. These data sources indicate the prevalences of names among racial groups. One limitation of these data, though, is that they indicate only how common the names are among groups and not how common individuals perceive the names to be among groups. Because subject perceptions might not match objective reality, we care more about the former than the latter when selecting names.

Audit Studies in Political Science

perceived. Unfortunately, there are multiple ways to measure similarity, and there are no hard rules about what counts as “similar” enough. Researchers should rely on their own contextual, theoretical, substantive knowledge when making this choice and transparently report their decision rule in their description of the design.9 3.3.3 Use Multiple Requests and Names A potential issue for researchers is that one of their subjects might receive multiple messages or that their subjects might share received messages with each other. This is potentially problematic if those messages are identical in nearly all aspects. That might lead subjects to doubt the authenticity of the messages or discern the intentions of the researchers, effectively spoiling the experiment. One way in which researchers can potentially get around this issue is by randomly varying aspects of the instrument, such as the included requests and sender names. In our running example, we might send several different types of requests to bureaucratic offices and use several different names to signal Black and White identities. By doing this, we would make it less likely that the messages would seem related to each other, which decreases the chances of potential discovery. As above, we suggest that researchers leverage their understanding of the phenomena they are studying to determine how similar the requests need to be. There is a second compelling reason to use a range of requests and names. Researchers often want to claim that the results of their study are indicative of more general social phenomena. By using different requests and different names, researchers can ensure that their results are not specific to any one request or name. In doing so, they can help maximize the external validity of their study.

9 One additional consideration when using multiple names and requests is that this additional variation in the study design can reduce statistical power. In order to minimize this issue, researchers should keep names and requests as similar as possible.

49

3.3.4 Use Reasonable Email or Postal Services Once researchers have developed an instrument, they need to deliver it. To do that, researchers typically create email or mail addresses for each identity that they use in their study. To maximize the believability of their intervention, researchers should use addresses that do not raise subject suspicions. This can be done by using common email or mail services, such as gmail.com or a post office box. If researchers are using email to deliver their instrument, they should consider creating unremarkable email addresses. For example, if the name for one identity is “Jane Smith,” they might want to create the email address “[email protected].”10 In some cases, researchers might want to have their identities associated with a real or fictional organization. This might lead them to partner with organizations and use their email domains to increase the believability of their messages. 3.3.5 Check the Final Instrument Once researchers have a draft of the final instrument, they should perform two additional checks. First, they should ensure that their instrument is not the same as one used in a prior study. As a corollary, they should not borrow parts of their instrument from previously completed studies. This can cause significant problems for researchers. As an example, White et al. (2015) used a set of names and email addresses to determine whether local election officials exhibited bias against Latinos in September 2012. Approximately four years later, a researcher used the same set of identities in an unrelated project. Some local election officials detected the similarities and posted a notice on a public bulletin board that individuals 10 If the researchers include numbers in their email addresses, they should think carefully about using sequences that do not necessarily indicate a birthdate, area code, zip code, or some other attribute of the sender. This means that researchers might want to avoid using three- and five-digit sequences that might reveal sender place and two-, four-, or six-digit sequences that might indicate sender age.

50

Daniel M. Butler and Charles Crabtree

should not respond to these emails, in effect contaminating the researcher’s study (Kovaleski 2019). Second, researchers should ask several individuals who are or have been part of the study population to read the instrument and provide input (Pfaff et al. 2019).11 In our running example, we could ask individuals who used to work at bureaucratic offices what they thought about our instrument. These exchanges between researchers and the subject population can help identify issues with the instrument or suggest new ways of improving it or the broader experimental design.

3.4 Additional Design Considerations 3.4.1 Internal Validity As we discuss above, the internal validity of any empirical claims made from audit studies depends on the subject pool not knowing that they are the participants in an experiment. Just as importantly, the internal validity of these claims also depends on the identities used by the auditors appearing identical (in both observed and unobserved ways) to participants, with the exception of whatever attributes researchers intentionally manipulate (Heckman 1998). When identities do not otherwise appear identical, then researchers cannot be sure that any discrimination that they measure is related to the characteristics that they manipulate or to some other characteristics that correlate with them and might vary across identities. Another way of thinking about this is that the results from audit studies depend on the excludability assumption that the manipulated characteristic drives differences in how subjects respond and not some other characteristic (Butler and Homola 2017; Gerber and Green 2012). This concern is part of the reason why most audit studies are conducted via correspondence now, rather than in person (Gaddis 2017; Neumark 2012). For example, by sending messages to subjects, researchers have more control over how they construct 11 These individuals should be excluded from the study.

identities, making it theoretically easier to create similar messenger profiles. In our running example, it would be more feasible to create two constituents who share all visible characteristics except for their race, as signaled via their name, than to find a White person who is interchangeable with a Black person in every other way except for their race. Even in this case, though, people might object that the excludability restriction does not hold, and that the names that researchers use might signal not only race, but other characteristics as well (Fryer and Levitt 2004). For example, one might object that a name like “Lakisha” not only indicates that the putative sender might be Black, but also that they come from a poorer or less educated family (Butler and Homola 2017; Gaddis 2017). Thankfully, there are a variety of ways in which we can empirically assess the extent to which identities might appear the same. One approach is to pretest the different identities with some survey population, such as MTurkers (Gaddis 2018). The idea here is to ask respondents a series of questions about each identity’s observed and unobserved characteristics.12 If the responses indicate that the only differences relate to the manipulated characteristics, then researchers can be more confident that they have not failed to set some attributes as constant. On the other hand, if the responses indicate that names are different across multiple dimensions – “Archibald” and “Jamal” likely signal both race and socioeconomic status – then researchers can adjust their lists of names accordingly. 3.4.2 Spam Concerns One potential concern with conducting audit studies via email is that messages might be automatically marked as Spam. This would mean that subjects might not receive their assigned treatments. This would potentially be very problematic if certain experimental

12 Researchers might also use a conjoint design (see Chapter 2 in this volume).

Audit Studies in Political Science

treatments or treatment combinations were more likely to be identified as Spam. The issue here is that this would decrease the probability that messages with those treatments would receive a reply, potentially leading researchers to believe that bias exists where it does not. To help guard against that possibility, researchers should test their messages using spam classifier software. This type of software estimates the probability that a message would be marked as Spam. The industry-leading tool for this is provided by Postmark and is available at https:// spamcheck.postmarkapp.com. 3.4.3 Email Tracking Technology also allows researchers to increase the power of their study by tracking who opens up the emails they send. Several services provide this tracking by embedding a small image in the email. This can increase power by allowing researchers to identify the people who were exposed to the treatment (Nickerson 2005). If the treatment is in the email, only those who open the email can see the treatment. This is analogous to getout-the-vote (GOTV) studies where the treatment is delivered face-to-face. Only those people who are at home during the canvassing hear the message. GOTV studies have increased power by knocking on all doors (including those assigned to the control condition) in order to identify the type of people who are at home to hear the message (Nickerson 2008). Then they restrict the sample to those who open the door in both the treatment and control groups. A similar approach can be used in email studies by limiting the sample to those in each treatment who actually opened up the email. Researchers should be careful in determining whether the treatment appears only after emails are opened. For example, many audit studies indicate the race of the sender at the bottom of an email message. Sometimes, though, researchers use email addresses that also indicate the race of the sender. When this is the case, recipients would receive treatments before opening emails.

51

3.5 Ethical and Other Considerations In addition to the design considerations we have discussed, authors should also consider the interests of the research subjects (also see the discussion in Chapter 27 in this volume). We should adjust the design when needed to minimize any harm to subjects. And if there is sufficient cause for concern (and no way to mitigate those concerns) we should not conduct the research. This is true even if the research receives approval from the institutional review board (IRB).13 Taking steps to mitigate potential concerns is in the interests of research subjects and our research community. The ethical considerations of all projects need to be evaluated on their own merits. However, there are some concerns that frequently arise with audit studies. First, research subjects may have to spend significant resources to respond. Even if this is not true for any one individual – some replies might take only a minute – it might still be true in the aggregate. Second, there is a concern about embarrassing study subjects by releasing their information.14 Third, there is a concern about the fact that researchers who conduct audit studies do not ask subjects for their consent. In trying to deal with these and other issues, our own advice is that researchers think carefully about the harm that they might cause their sample and adjust their study to minimize this harm to trivial levels or, ideally, eliminate it completely. For 13 As Driscoll (2015) points out, IRBs exist principally to minimize the legal liability of colleges and not to carefully evaluate the ethics of all studies. Since researchers know comparatively more about their subjects and their study design, they must actively police their own work. We suggest, though, that all researchers receive IRB approval before conducting their studies, not only because review boards might provide informative feedback, but also because researchers can be demoted or lose their jobs if they do not (Robertson 2015). 14 A necessary but not sufficient condition for protecting study subjects is to delete any personal identifiable information from data sets before publicly releasing them. In the context of most audit studies, this means deleting names, contact information, addresses (since they can frequently be connected to individuals), and individual replies.

52

Daniel M. Butler and Charles Crabtree

instance, a researcher might want to conduct a study of election officials in advance of an election. The worst thing that might happen here is that the intervention causes election officials to have less time for registering voters, resulting in lower voter turnout. To minimize the potential harms, the researcher might conduct the study during a nonelection period.15 We can think about this rule of thumb as we discuss the three issues raised above. Researchers can also apply the same rule of thumb as other project-specific issues arise in their own research. A more general and also more complete discussion of experiments and ethics is provided in Chapter 7 in this volume. One common concern is that researchers are using public officials’ time. This concern is one reason why many audit studies include simple requests in their instruments. We think that researchers should continue this general practice and keep time-consuming requests to a minimum, even if this imposes constraints on what can be studied (also see Chapter 8 in this volume for a discussion of the ethics of using public officials’ time). Another thing we can do more often is to recruit real people to send the messages. Butler et al. (2012) conducted an experiment in which they asked students to write to their Members of Congress and their state legislator. Not only did the students write the letters themselves, they were also given the responses that the researchers received. It is true that these students would probably not have written their letters without encouragement from the researchers; however, we think that encouraging people to communicate with their elected officials is generally a good thing. One could easily imagine this approach being part of a homework assignment in a class. Indeed, this is one way in which we as researchers might implement this suggestion: by asking students in a class 15 It might be less interesting, of course, or less relevant theoretically to examine biases in election official behavior at nonelection times. This example highlights the trade-off that researchers must sometimes face in minimizing subject harm.

to participate by writing a letter in which something about the communication is randomized. While we think that researchers should use real people when possible, the reality is that researchers will not use this approach unless reviewers reward it. We should not simply accept an audit study because it uses real people (nor should we reject an audit study because it does not). Rather, we are advocating that using real people should be a positive consideration when evaluating a paper for publication. The possibility of embarrassing research subjects is one of the other concerns that is raised about most audit studies. We must guard against this. The goal of audit studies is not to embarrass specific people. The goal is to identify systematic problems so that they can be improved. Maintaining confidentiality is the single most important way in which we can mitigate this concern. This is a simple solution, and it is vital. Failing to maintain confidentiality could have negative effects not only for individual researchers, but also for the study of political elites, and perhaps the field as a whole. The third factor we raised above – consent – is the hardest issue. For all of the reasons we have laid out above, an audit study is most useful when people do not know they are being studied. If consent is required, the worst offenders could opt out and/or people might change their behavior to act better than they normally do when interacting with others. Requiring consent would ruin the usefulness of most audit studies. The concerns about not getting consent, along with other potential issues, have to be weighed alongside the benefits of the study. We think that if other concerns are sufficiently minimized and the articulated benefit is clear, then many audit studies are still worth pursuing. Identifying bias and studying ways to minimize it can provide important societal benefits. Because of these concerns, all researchers should ask whether they need to conduct an audit study to answer their question of interest. In some cases, researchers can answer their question by reanalyzing the

Audit Studies in Political Science

results of a previous audit study. If so, then the researchers should approach the question in that way. In practice, this can be done by researchers sharing their data in ways that maintain the research subjects’ confidentiality. Ideally, these deidentified data would be shared on the Dataverse or a GitHub repo, so that others might have access to them and be able to easily build on prior work. Such efforts might help reduce the number of audit studies that need to be completed, thereby minimizing the possibility that researchers spoil the commons by overusing the same samples. Given the important function that audit studies can serve in documenting discrimination and inspiring policy change, we think that all interested researchers should do what they can to ensure that these studies remain a feasible way of measuring behavior in the future.

3.6 Future Directions More broadly, we see a bright future for audit study work and the need for more research in several important areas, particularly in the study of discrimination. First, researchers should identify why discrimination occurs. The results from existing audit studies largely affirm the historical, qualitative, and quantitative literatures on the prevalence of discrimination and its targets. Researchers, though, have done relatively little to identify the mechanisms that drive it.16 Until these mechanisms are understood, we cannot devise interventions to mitigate it. Second, researchers should consider expanding the subjects of audit study work in political science. Most studies have focused on examining biases among elected officials. While there has been some work on identifying discrimination by bureaucratic agents (e.g., White et al. 2015), much more work remains to be done with this vital sample of public servants. Third, researchers should consider expanding the geographic focus of audit studies. The majority of studies in political science have focused on examining 16 See Pfaff et al. (2019) for an important exception.

53

discrimination in the USA. Work in other Western or democratic contexts is scarce, and studies in authoritarian states are virtually nonexistent. Fourth, scholars should start quantitatively summarizing past audit study work (e.g., Costa 2017). We think that by conducting work in these four key areas researchers can substantially improve our understanding of discrimination in politics.

References Adida, Claire L., Laitin, David D., and Valfort, Marie-Anne. 2010. “Identifying barriers to Muslim integration in France.” Proceedings of the National Academy of Sciences of the United States of America 107(52): 22384–22390. Ahmed, Ali M., Lina Andersson, and Mats Hammarstedt. 2012. “Does age matter for employability? A field experiment on ageism in the Swedish labour market.” Applied Economics Letters 19(4): 403–406. Ayres, Ian, and Peter Siegelman. 1995. “Race and gender discrimination in bargaining for a new car.” American Economic Review 85(3): 304–321. Bertrand, Marianne, and Sendhil Mullainathan. 2004. “Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.” American Economic Review 94(4): 991–1013. Butler, Daniel M. 2014. Representing the Advantaged: How Politicians Reinforce Inequality. New York: Cambridge University Press. Butler, Daniel M., Christopher Karpowitz, and Jeremy C. Pope. 2012. “A Field Experiment on Legislators’ Home Style: Service versus Policy.” Journal of Politics 74(2): 474–486. Butler, Daniel M., and David E. Broockman. 2011. “Do politicians racially discriminate against constituents? A field experiment on state legislators.” American Journal of Political Science 55(3): 463–477. Butler, Daniel M., and Eleanor Neff Powell. 2014. “Understanding the Party Brand: Experimental Evidence on the Role of Valence.” Journal of Politics 76(2): 492–505. Butler, Daniel M., and Jonathan Homola. 2017. “An empirical justification for the use of racially distinctive names to signal race in experiments.” Political Analysis 25(1): 122–130. Coppock, Alexander. “Avoiding Post-treatment Bias in Audit Experiments.” Journal of Experimental Political Science 6(1): 1–4.

54

Daniel M. Butler and Charles Crabtree

Costa, Mia. 2017. “How responsive are political elites? A meta-analysis of experiments on public officials.” Journal of Experimental Political Science 4(3): 241–254. Charles Crabtree and Volha Chykina. 2018.“Last name selection in audit studies.” Sociological Science 5: 21–28. Crabtree, Charles, Christopher J. Fariss, and Holger L. Kern. 2019. “What Russian Private Media Censor: New Evidence from an Audit Study.” Working paper. Doherty, David, and E. Scott Adler. 2020. “Campaign mailers and intent to turnout: Do similar field and survey experiments yield the same conclusions?” Journal of Experimental Political Science 7(2): 150–155. Driscoll, Jesse. 2015. “Prison states and games of chicken.” In Ethics and Experiment. Abingdon: Routledge, pp. 95–110. Drydakis, Nick. 2014. “Sexual orientation discrimination in the Cypriot labour market. Distastes or uncertainty?” International Journal of Manpower 35(5): 720–744. Findley, Michael G., and Daniel L. Nielson. 2016. “Obligated to deceive? Aliases, confederates, and the common rule in international field experiments.” In Ethics and Experiments: Problems and Solutions for Social Scientists, ed. Scott Desposato. Routledge Experiments in Political Science Series. Abingdon: Routledge, pp. 134–149. Findley, Michael G., Daniel L. Nielson, and Jason Sharman. 2015. “Causes of non-compliance with international law: Evidence from a field experiment on financial transparency.” American Journal of Political Science 59(1): 146–161. Fryer Jr., Roland G., and Steven D. Levitt. 2004. “Understanding the black–white test score gap in the first two years of school.” Review of Economics and Statistics 86(2): 447–464. Gaddis, S. Michael. 2017. “How black are Lakisha and Jamal? Racial perceptions from names used in correspondence audit studies.” Sociological Science 4: 469–489. Gaddis, S. Michael. 2018. “An introduction to audit studies in the social sciences.” In Audit Studies: Behind the Scenes with Theory, Method, and Nuance, ed. S. M. Gaddis. New York: Springer, pp. 3–44. Gell-Redman, Micah, Neil Visalvanich, Charles Crabtree, and Christopher J. Fariss. 2018. “It’s all about race: How state legislators respond to immigrant constituents.” Political Research Quarterly 71(3): 517–531.

Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as data: The promise and pitfalls of automatic content analysis methods for political texts.” Political Analysis 21(3): 267–297. Heckman, James J. 1998. “Detecting discrimination.” Journal of Economic Perspectives 12(2): 101–116. Hemker, Johannes, and Anselm Rink. 2017. “Multiple Dimensions of Bureaucratic Discrimination: Evidence from German Welfare Offices.” American Journal of Political Science 61(4): 786–803. Johnson, David A., Richard J. Porter, and Patricia L. Mateljan. 1971. “Racial discrimination in apartment rentals.” Journal of Applied Social Psychology 1(4): 364–377. King, Gary, Jennifer Pan, and Margaret E. Roberts. 2014. “Reverse-engineering censorship in China: Randomized experimentation and participant observation.” Science 345(6199): 1–10. Kovaleski, Tony. 2016. “FBI probes emails sent to county clerks across Colorado and 13 other states.” URL: TheDenverChannel.com (accessed April 2, 2019). Lagunes, Paul, and Oscar Pocasangre. 2019. “Dynamic transparency: An audit of Mexico’s Freedom of Information Act.” Public Administration 97(1): 162–176. Matsusaka, John G. 2001. “Problems with a methodology used to evaluate the voter initiative.” Journal of Politics 63(4): 1250–1256. Mendez, Matthew S., and Christian R. Grose. 2018. “Doubling Down: Inequality in Responsiveness and the Policy Preferences of Elected Officials.” Legislative Studies Quarterly 43(3): 457–491. Montgomery, Jacob M., Brendan Nyhan, and Michelle Torres 2018. “How conditioning on post-treatment variables can ruin your experiment and what to do about it.” American Journal of Political Science 62(3): 760–775. Neumark, David. 2012. “Detecting discrimination in audit and correspondence studies.” Journal of Human Resources 47(4): 1128–1157. Nickerson, David W. 2005. “Scalable protocols offer efficient design for field experiments.” Political Analysis 13(3): 233–252. Nickerson, David W. 2008. “Is voting contagious? Evidence from two field experiments.” American Political Science Review 102(1): 49–57. Pager, Devah. 2003. “The mark of a criminal Record.” American Journal of Sociology 108(5): 937–975.

Audit Studies in Political Science Pager, Devah, and Hana Shepherd. 2008. “The sociology of discrimination: Racial discrimination in employment, housing, credit, and consumer markets.” Annual Review of Sociology 34: 181–209. Pager, Devah, and Lincoln Quillian. 2005. “Walking the talk? What employers say versus what they do.” American Sociological Review 70(3): 355–380. Pearce, Diana M. 1979. “Gatekeepers and Homeseekers: Institutional patterns in racial steering.” Social Problems 26(3): 325–342. Pfaff, Steve, Charles Crabtree, Holger L. Kern, and John B. Holbein. 2019. “Does Religious Bias Shape Access to Public Services? A LargeScale Audit Experiment among Street-Level Bureaucrats.” Working paper. Putnam, Robert D. 1993. Making Democracy Work: Civic Traditions in Modern Italy. Princeton, NJ: Princeton University Press. Quillian, Lincoln, Devah Pager, Ole Hexel, and Arnfinn H. Midtbøen 2017. “Meta-analysis of field experiments shows no change in racial discrimination in hiring over time.” Proceedings of the National Academy of Sciences of the United States of America 114(41): 10870– 10875. Rivera, Lauren A., and Andras Tilcsik. 2016. “Class advantage, commitment penalty: The gendered effect of social class signals in an elite labor market.” American Sociological Review 81(6): 1097– 1131. Robertson, Joshua. February 2015. “University suppressed study into racism on buses and ‘victimised’ its co-author.” Guardian. URL: https:// bit.ly/2Xz3zmt (accessed April 1, 2019).

55

Terechshenko, Zhanna, Charles Crabtree, Kristine Eck, and Christopher J. Fariss. 2019. “Evaluating the influence of international norms and sanctioning on state respect for rights: A field experiment with foreign embassies.” International Interactions 45(4): 720–735. Turner, Margery A., and Judson James. 2015. “Discrimination as an object of measurement.” Cityscape: A Journal of Policy Development Research 17(3): 3–14. Weichselbaumer, Doris. 2016. “Discrimination against female migrants wearing headscarves.” URL: https://ssrn.com/abstract=2842960 (accessed April 1, 2019). White, Ariel R., Noah L. Nathan, and Julie K. Faller 2015. “What do I need to vote? Bureaucratic discretion and discrimination by local election officials.” American Political Science Review 109(1): 129–142. Wienk, Ronald. E., Clifford E. Reid, John C. Simonson, and Frederick J. Eggers. 1979. Measuring Racial Discrimination in American Housing Markets: The Housing Market Practices Survey. Washington, DC: Department of Housing and Urban Development, Office of Policy Development and Research. Yinger, John. 1991. “Acts of discrimination: Evidence from the 1989 housing discrimination study.” Journal of Housing Economics 1: 318–346. Yinger, John. 1993. “Access denied, access constrained: Results and implications of the 1989 housing discrimination study.” In Clear and Convincing Evidence: Measurement of Discrimination in America, eds. Michael E. Fix and Raymond J. Struyk. Washington, DC: The Urban Institute Press, pp. 69–112.

CHAPTER 4

Field Experiments with Survey Outcomes∗

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Abstract Field experiments with survey outcomes are experiments where outcomes are measured by surveys but treatments are delivered by a separate mechanism in the real world, such as by mailers, door-to-door canvasses, phone calls, or online ads. Such experiments combine the realism of field experimentation with the ability to measure psychological and cognitive processes that play a key role in theories throughout the social sciences. However, common designs for such experiments are often prohibitively expensive and vulnerable to bias. In this chapter, we review how four methodological practices that are currently uncommon in such experiments can dramatically reduce costs and improve the accuracy of experimental results when at least two are used in combination: (1) online surveys recruited from a defined sampling frame (2) with at least one baseline wave prior to treatment (3) with multiple items combined into an index to measure outcomes and, (4) when possible, a placebo control for the purpose of identifying which subjects can be treated. We provide a general and extensible framework that allows researchers to determine the most efficient mix of these practices in diverse applications. We conclude by discussing limitations and potential extensions.

4.1 Field Experiments with Survey Outcomes: What and Why? Researchers of political psychology, intergroup prejudice, media effects, learning, * This chapter is adapted from the previously published paper, “The Design of Field Experiments with Survey Outcomes: A Framework for Selecting More Efficient, Robust, and Ethical Designs” (Broockman et al. 2017).

56

public health, and other topics frequently test how randomized stimuli affect outcomes measured in surveys. For example, survey experiments that measure the effects of randomized stimuli presented in a survey on individuals’ responses to questions in the same survey constitute a dominant paradigm in political science (Druckman et al. 2006; Mutz 2011; Sniderman and Grob 1996).

Field Experiments with Survey Outcomes

Survey experiments are typically defined as the “deliberate manipulation of the form or placement of items in a survey instrument, for the purposes of inferring how public opinion works in the real world” (Gaines et al. 2007, p. 4). For example, in a study on the persuasive effect of targeted campaign mail, Hersh and Schaffner (2013) showed 2596 respondents in the 2011 Cooperative Congressional Election Study a mock campaign mailer that randomly assigned the candidate’s political party and whether the candidate vowed to work on behalf of a particular group (e.g., “My pledge: To represent the interests of [Latinos vs. constituents] in Congress”). The survey then asked respondents how likely they would be to vote for the candidate. The researchers found that voters tend to prefer the broad over the targeted appeals. As Krupnikov and Findley (2018) note, survey experiments such as that of Hersh and Schaffner (2013) might be able to capture “real-world” behavior for two reasons. First, survey experiments can be completed in a participant’s natural environment, such as on their home computer. Second, survey experiments can be conducted on representative samples, producing generalizable findings. However, as Krupnikov and Findley (2018) are quick to point out, in survey experiments, participants recognize that they are part of a research project: there is no separation between the delivery of the treatment and the measurement of the outcome, potentially biasing the results due to demand effects. Furthermore, in an article comparing the results of survey experiments with contemporaneous natural experiments, Barabas and Jerit (2010) find that survey experiments tend to substantially overstate how a stimulus might affect voters in the “real world.” These researchers note that, in survey experiments, individuals pay more attention to the stimuli presented than they otherwise would in the “real world,” and that outside the confines of a survey experiment, individuals are often presented with many competing stimuli, diminishing the effect of any one stimulus. Finally, lab- and surveybased experiments also typically measure

57

outcomes immediately, whereas long-run effects are often more relevant. Some researchers have turned to field experiments to help address these potential limitations. However, field experiments traditionally rely on behavioral outcomes such as voter turnout, whereas many research questions require survey outcomes to answer. For example, many theories hinge on psychological constructs, such as affect or ideology, which are difficult to measure without surveys. This chapter discusses a hybrid approach that weds aspects of traditional field and survey experiments, field experiments with survey outcomes: experiments where outcomes are measured by surveys, but randomized stimuli are delivered by a separate mechanism in the real world, such as by mailers, doorto-door canvasses, phone calls, or online ads. In these field experiments, participants are expected to respond to the stimulus as they would outside of a field experiment; when an individual participating in a field experiment receives a campaign mailer, they should be unaware that this mailer is connected in any way to a broader research project. Thus, they should read and react to that campaign mailer as if it were any other mailer. For example, Blattman et al. (2019) randomly assigned whether or not villages in Uganda received an anti-votebuying campaign; among other outcomes, the authors conducted a survey to measure the campaign’s effects on reported vote-selling behavior and attitudes towards vote-buying. Gerber et al. (2009) randomly assigned some households to receive free subscriptions to newspapers and conducted follow-up telephone surveys of those sent and not sent the papers, querying political attitudes and knowledge. In these experiments, surveys critically facilitate the measurement of political attitudes and knowledge, but the “real-world” nature of the treatments they study also arguably makes their inferences more credible. Despite the potential utility of field experiments with survey outcomes for many research questions, they are nevertheless

58

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

somewhat rare in political science, and scholars face several barriers to implementing them. Most importantly, the common designs for such experiments can be prohibitively expensive. For example, an experiment that is powered to detect a “small” treatment effect of 0.1 standard deviations on a survey outcome could cost under $500 as a survey experiment using Mechanical Turk, but could easily cost over $1,000,000 as a field experiment using designs that are common today (see Section 4.2). In addition, the results of such experiments are vulnerable to bias from differential attrition, which occurs when treatments influence survey completion. This has been shown to occur and produce meaningfully large bias (Bailey et al. 2016), yet is often undetectable with common designs. The goal of this chapter is to help researchers conduct field experiments with survey outcomes that are significantly less expensive, more precise, and more robust. The chapter primarily focuses on reviewing design practices in such experiments that can have dramatic consequences for costs, precision, and robustness (Broockman et al. 2017). We also provide running examples to help illustrate how these practices can be implemented and with what consequences. While many of the running examples we use in describing these ideas stem from American politics, we provide a general framework that can be applied across settings to help guide the choice of designs and provide examples of how this framework can be applied crossnationally. We also provide software that researchers can use to help design their own field experiments with survey outcomes. The main focus of this chapter is to describe and analytically decompose complementarities between these four practices. These practices are: (1) surveys administered online to a sample recruited from an ex ante defined sampling frame e.g., (Barber et al. 2014) (2) with at least one baseline wave prior to treatment (Iyengar and Vavreck 2012) (3) with multiple measures of outcomes gathered and combined into an index at each wave (Cronbach 1951) and, if possible, (4) a placebo wherein control subjects are

contacted with an unrelated appeal for the purpose of measuring compliance (i.e., to identify which subjects could have been treated) (Nickerson 2005). The complementarities between these four practices can yield extremely large gains. These practices are not novel on their own. Moreover, in common cases, when used alone each one does not increase efficiency considerably or at all. However, these practices interact in a nonadditive way such that employing at least two in combination can dramatically relax the constraints typically associated with field experiments with survey outcomes; in some examples, they decrease variable costs1 by 98%. Figure 4.1 previews some of our results about how these practices can interact in common settings. Figure 4.1 considers the variable costs of conducting a study in a common setting in the literature, a field experiment studying the persuasive effect of door-to-door canvassing of registered voters in the USA that measures outcomes in two rounds of post-treatment surveys to measure both short-run and long-run effects. Each row in Figure 4.1 corresponds to a different possible design: all 16 permutations of using or not using each of the four practices we discuss. The length of each bar corresponds to the cost of each possible design for achieving a fixed level of precision (a standard error of 0.045 standard deviations), assuming empirical parameters about survey costs, canvassing contact rate (i.e., compliance rate), and so forth estimated from two empirical studies. (These parameters are examples only. We describe how we calculated them from our empirical studies and the literature in the online appendix of Broockman et al. (2017).) The bar labeled “Traditional Design” shows the variable costs of a traditional experiment employing the modal design in the literature, which employs none of the four practices we study and relies on a telephone survey instead of an online survey to collect outcomes. Finally, the bar labeled “All Four Practices” 1 Throughout, we consider the variable costs of experiments only, not fixed costs such as the costs of pretesting a survey instrument, purchasing data on voters, etc.

59

Field Experiments with Survey Outcomes Pre=No, Placebo=No, Multiple Measures=Yes, Mode=Phone Pre=Yes, Placebo=No, Multiple Measures=Yes, Mode=Phone Pre=No, Placebo=No, Multiple Measures=No, Mode=Phone

Traditional Design

Pre=No, Placebo=No, Multiple Measures=No, Mode=Online Pre=No, Placebo=No, Multiple Measures=Yes, Mode=Online Pre=Yes, Placebo=No, Multiple Measures=No, Mode=Phone

Design

Pre=Yes, Placebo=No, Multiple Measures=No, Mode=Online Pre=No, Placebo=Yes, Multiple Measures=Yes, Mode=Phone Pre=No, Placebo=Yes, Multiple Measures=No, Mode=Phone Pre=No, Placebo=Yes, Multiple Measures=No, Mode=Online Pre=No, Placebo=Yes, Multiple Measures=Yes, Mode=Online Pre=Yes, Placebo=Yes, Multiple Measures=Yes, Mode=Phone Pre=Yes, Placebo=Yes, Multiple Measures=No, Mode=Phone Pre=Yes, Placebo=No, Multiple Measures=Yes, Mode=Online Pre=Yes, Placebo=Yes, Multiple Measures=No, Mode=Online Pre=Yes, Placebo=Yes, Multiple Measures=Yes, Mode=Online

All Four Practices $0

$500,000

$1,000,000

Variable Cost

Figure 4.1 Comparing costs of different designs. Notes: This figure uses the framework we developed to estimate the feasability of multiple potential experimental designs for an example door-to-door canvassing study that assumes the empirical parameters described in the online appendix of Broockman et al. (2017). Each bar corresponds with the cost of one potential experimental design. The label for the bar denotes whether each of the practices we discuss in this chapter is used.

shows the variable cost of an experiment using all four practices. Figure 4.1 shows that, in this common setting, an experiment using all four practices can be significantly less expensive than an experiment using only one of these practices. An experiment with a variable cost of over $1,000,000 with none of these practices could instead cost approximately $20,000. In addition, such an experiment would be able to precisely test additional design assumptions and require real-world intervention on only a minuscule scale. Of course, Figure 4.1’s empirical results about the benefits of these four practices are specific to a particular intervention, population, and context; below, we provide general formulas that can be used to identify the costs of alternative designs in different settings. Accordingly, this chapter also reviews a general and extensible framework that allows researchers to select the most efficient mix of these practices in a wide variety

of applications and that can be easily extended to accommodate unique features of particular settings. This framework analytically captures the effect of parameters such as survey response rates, treatment application rates (e.g., the proportion of voters randomized to receive a canvassing intervention who answer their doors and can be treated), and the stability of survey responses on the cost of field experiments with survey outcomes that do or do not employ each of the four practices we consider. This framework also captures the gains in efficiency that can arise from the complementarities between the four practices we study. We provide several examples of how researchers can use this framework to select more efficient and robust designs in a wide variety of applications, just as Figure 4.1 did for the US door-to-door canvassing study. We conclude by discussing the remaining limitations and potential extensions of field experiments with survey outcomes.

60

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Sample Defined

Treatment Group

Control Group

Random Assignment of N Subjects

Treatment Successfully Delivered

Treatment Could Have Been Delivered (Unobservable)

Treatment Delivery Attempted

Outcomes Measured, One Key Survey Item (Ostensibly Unrelated Telephone Survey)

Figure 4.2 “The traditional design.”

4.2 Field Experiments with Survey Outcomes: Typical Designs and Their Challenges How do political science researchers typically conduct field experiments with survey outcomes today? In Broockman et al. (2017), we describe the existing, publicly available political science field experiments with survey outcomes of which we are aware (Broockman et al. 2017, table 1). There, we show that the existing experiments rarely take advantage of the complementarities between the four practices named above. Of the 14 existing, publicly available political science field experiments with survey outcomes in which a placebo was possible (because compliance could be observed due to the nature of the treatment),2 only one study (Broockman and 2 For example, studies with door-to-door canvassing interventions can measure compliance by recording

Kalla 2016) was conducted using all of the practices described in this chapter.3 Similarly, of the 19 existing studies in which a placebo was not possible, none use all three possible practices. 4.2.1 Common Practices and Challenges in Field Experiments with Survey Outcomes To build familiarity with common existing practice, Figure 4.2 depicts the modal design of the field experiments from the existing literature. The modal design does not employ any of the four practices we study. We will call this “the traditional design.” An analyst whether subjects appear at their doors when canvassers knock. However, studies of mail-based interventions may not be able to accurately record whether individuals open their mail. 3 Since the publication of Broockman et al. (2017), Kalla and Broockman (2018) report nine additional experiments using all four practices.

Field Experiments with Survey Outcomes

first defines a sample of individuals and randomly assigns them to treatment and control groups (moving from the first to the second step in Figure 4.2). Delivery of the treatment is attempted with treatment group subjects (the third step in Figure 4.2), but many treatment group subjects are not successfully treated (thus noncompliance is present). Control group subjects are not contacted. Finally, as shown in the last step in Figure 4.2, all subjects originally assigned to either condition are then solicited for an ostensibly unrelated follow-up survey, which few answer, that contains one key survey item of interest. In this section, we review the challenges that field experiments with traditional designs often face. The following section will formalize how the methodological practices we describe can ameliorate each, especially when used in combination. To help illustrate key ideas, throughout we assume several example values for marginal costs of surveys, treatment, etc. However, we caution readers that these example values are for exposition purposes only and likely vary across contexts and time.

4.2.1.1 noncompliance (failure to treat) Failure to treat arises when some treatment group subjects are not successfully administered treatment. It increases necessary sample sizes (Gerber and Green 2012). To appreciate how, imagine planning an experiment to assess the impact of a door-to-door canvassing treatment powered to detect a 5 percentage point effect. Further suppose canvassers contact 20% of treatment group subjects (as in Bailey et al. 2016). A 5 percentage point effect among those contacted would manifest as an overall difference of 5 × 0.20 = 1 percentage point between the entire treatment and control groups. A final sample of approximately 80,000 survey responses would be required to detect this 1 percentage point effect with 80% power. That is, the last step in Figure 4.2 would need 80,000 observations. The budgetary implications of failure to treat are especially unfavorable in field experiments with survey outcomes because

61

failure to treat increases both the number of subjects one must treat and the number of subjects one must survey. Consider the example just discussed hoping to yield 80,000 survey responses for analysis. Assuming for the moment that survey response rates are 100%, the experimenter must pay to knock on the doors of the 40,000 subjects in the treatment group and to survey all 80,000 subjects. At marginal costs of $3 per canvass attempt and $5 per survey response, the experiment’s variable cost would be $520,000, of which $400,000 is survey costs. However, if all subjects in the treatment group could be actually treated, only 3200 subjects would be necessary, resulting in variables costs of only $20,800, with only $16,000 in survey costs.

4.2.1.2 survey nonresponse Field experiments with survey outcomes usually collect outcomes by telephone, and response rates to telephone surveys in the USA and other developed countries are now typically under 10% (Kohut et al. 2012; Leeper 2019). In anticipation of this nonresponse, analysts must treat many more subjects, increasing treatment costs. To see how, consider the experiment described above. Recall that 80,000 post-treatment responses were needed in total across subjects assigned to treatment and control; with a survey response rate of 100%, this meant that 40,000 individuals would need to be assigned to the treatment group and attempted for treatment. However, anticipating a response rate of 10% to a final survey, an analyst must attempt to canvass 400,000 voters in order to yield 40,000 voters both attempted for canvassing and then successfully surveyed. Assuming marginal treatment costs scale linearly, this would increase treatment variable costs from $120,000 to $1,200,000 (in addition to the $400,000 in survey costs already discussed).

4.2.1.3 limited pretreatment covariates available Finally, many field experiments with survey outcomes have few pretreatment covariates available that predict outcomes well. For example, Bailey et al. (2016) find that

62

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

commercial scores and administrative data could only predict survey responses to a presidential vote choice question with an R2 of 0.005. Such limited predictive power has several disadvantages. First, although baseline covariates can increase the precision of estimates (e.g., Sävje et al. 2016), covariates that predict outcomes poorly do not meaningfully do so. For example, when R2 = 0.005, the sample size necessary to achieve the same precision decreases by only 0.5%. In addition, lacking prognostic covariates makes differential attrition difficult to detect. Differential attrition arises when the treatment influences survey response rates,4 jeopardizing the comparability of the surveyed treatment and control groups (Gerber and Green 2012, ch. 7). Any experiment with survey outcomes without prognostic pretreatment covariates cannot persuasively evaluate the assumption of no differential attrition, even though this assumption has been found to fail (Bailey et al. 2016). Finally, the absence of pretreatment covariates also precludes testing any theoretical predictions about how treatment effects are moderated by prior attitudes or previous exposure (e.g., Druckman and Leeper 2012).

4.3 How the Four Practices Can Reduce Costs and Increase Robustness The practices we study are able to substantially ameliorate many of these challenges. In this section, we provide a formal analysis comparing the asymptotic efficiency of experiments that employ some or all of the practices we consider to the traditional design shown in Figure 4.2. We first describe and consider the trade-offs each of these practices involves and how each practice complements the others. We then use these analyses to build a framework for evaluating trade-offs between possible designs using different mixes of these four practices. 4 For example, suppose pro-Clinton phone calls discourage Trump supporters from answering surveys later.

Our framework can accommodate a wide variety of possible settings, and all four of the practices we study will not be optimal in all of these settings. However, to build understanding about how each of these practices logistically functions, we begin by describing a possible study using all four practices in the setting of a door-to-door canvassing experiment targeting US registered voters’ attitudes, as in Broockman and Kalla (2016): 1. Baseline survey recruitment. First, a researcher sends mail to a sampling frame of registered voters inviting them to complete a baseline online survey with multiple measures of outcomes. The survey collects respondents’ email addresses so that they can be invited to follow-up surveys later. When subjects are asked to provide consent to this survey, it is important that they not be told that they may receive a treatment or placebo intervention. Alerting subjects to the connection between the intervention and survey measurement may produce biased treatment effect estimates through demand effects or social desirability, undermining a key motivation for conducting a field- instead of surveybased experiment. 2. Treatment or placebo delivery. Next, treatment is delivered to baseline survey respondents only, as is a placebo if possible. Only respondents to the baseline survey are targeted with a realworld intervention ostensibly unrelated to the survey. For example, a canvasser may visit baseline survey respondents’ homes and deliver either the treatment or placebo. In Broockman and Kalla (2016), canvassers were sent to the same addresses from the voter registry used to target the mail, and the treatment was a conversation that encouraged taking the perspective of transgender people. r A placebo is an unrelated intervention that is administered for the purpose of measuring compliance – that is, whether the treatment could have been delivered. For example, in the door-to-door canvassing reported

Field Experiments with Survey Outcomes

in Broockman and Kalla (2016), canvassers attempted to have a conversation about recycling with voters in the placebo group for the purpose of measuring whether the perspective-taking treatment could have been delivered to them. We use the terms “placebo group” and “control group” interchangeably when describing designs that include placebos.5 3. Follow-up survey recruitment. Finally, the researcher conducts a follow-up survey. If a placebo is used, this survey would only target individuals who were contacted; if no placebo is used, all individuals randomly assigned would be targeted for a follow-up survey. For example, in Broockman and Kalla (2016), respondents are invited via email to complete these follow-up surveys. Before we begin considering the four practices we review in detail, the previous example provides an illustration of how they can function together. 4.3.1 Setup for Formal Analysis In this subsection, we detail the assumptions and estimators that form the basis of our formal analysis of the four practices we study. Readers familiar with the design and analysis of experiments with noncompliance may wish to skip this subsection.

63

We assume a random sample of size N from an infinite population. Let zi ∈ {0, 1} denote the treatment randomly assigned to subject i, and let di (z) ∈ {0, 1} indicate whether subject i is actually treated when the treatment assignment zi = z. Let Yi (z, d) denote the potential outcome for subject i when zi = z and di = d. We assume the usual noninterference assumption, so the potential outcome of i only depends on the treatment subject i is assigned and receives. We also make the usual exclusion assumption, Yi (z, d) = Yi (d), such that there is no way for the random assignment to affect the outcome except by influencing receipt of treatment. We define Yi (z = 1) = Yi (z = 1, d = di (1)). Compliers are those subjects who take treatment when they are assigned to the treatment group and do not take treatment when they are assigned to the control group (i.e., subjects for whom di (1) = 1 and di (0) = 0). We assume no subjects assigned to control are treated, such that di (0) = 0 for all i. Our estimand of interest is the complier average causal effect (CACE) (Gerber and Green 2012), p., 142, defined as: CACE = E[Yi (d = 1) − Yi (d = 0) | di (0) = 0, di (1) = 1].

(4.1)

An alternative estimand, which ignores compliance, is the intent-to-treat estimand, defined as: ITT = E [Yi (z = 1) − Yi (z = 0)]

5 Nickerson (2005) introduced the placebo design and summarizes it this way:

= E [Yi (z = 1, d(1))−Yi (z = 0, d(0))].

Rather than rely upon a control group that receives no attempted treatment, the group receiving the placebo can serve as the baseline for comparison for the treatment group …assuming that (1) the two treatments have identical compliance profiles; (2) the placebo does not affect the dependent variable; and (3) the same type of person drops out of the experiment for the two groups.

The intent-to-treat effect of treatment assignment (z) on receipt of treatment (d) is defined as:

The assumptions Nickerson (2005) names are critical, and satisfying them precludes the use of strategies such as asking subjects for their compliance status. For example, asking subjects whether they recalled receiving campaign or placebo mail would likely not satisfy Nickerson (2005) first assumption, since different subjects are likely to recall receiving campaign and placebo mail.

which equals E [di (1)] because di (0) = 0 for every i under one-way noncompliance. With this setup, CACE can be estimated in two ways. First, we can observe who in the control group could have been treated with the placebo design (Nickerson 2005).

ITTd = E [di (1) − di (0)] ,

64

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Observing d(1) for all i (i.e., whether treatment could have been delivered for all subjects, including the control group), we can plug in sample estimates in Eq. (4.1). We refer  Placebo . to this estimator as CACE The second approach, more common in existing field experiments with noncompliance, is to divide the intent-to-treat effect by the compliance rate:  ITT = ITT , CACE ITTd which represents the usual instrumental variables estimator (and does not require that the treatment and placebo, if used, have identical compliance profiles). As with all field experiments with survey outcomes and noncompliance, both estimates are local to compliers who complete surveys (an issue we return to below). One may use the delta method to obtain the following asymptotic variance for  ITT : CACE 2 1  + ITT V(ITT) 4 2 ITTd ITTd  d ) − 2 ITT C(ITT,  ITT  d ) (4.2) V(ITT 3 ITTd

 ITT ) ≈ V(CACE

(Imbens and Rubin 2015, p. 531), where C denotes covariance. Prior work in this literature has examined the asymptotic variance of estimators  ITT ignoring the last two terms of of CACE Eq. (4.2) (Gerber and Green 2012; Nickerson 2005):  ITT ) ≈ V(CACE

1  V(ITT). ITTd2

(4.3)

Intuitively, this captures the fact that, when comparing the entire treatment and control groups, the presence of noncompliers in both groups clouds our estimate of the difference between treatment and control compliers, increasing the variance of the estimate of the CACE.6 6 For our purposes, ignoring the last two terms in Eq. (4.2) allows for a cleaner comparison between

We also make several additional assumptions throughout to simplify exposition of the key ideas. We assume a balanced experimental design with 50% allocation to treatment and 50% allocation to control (which, depending on whether a placebo is used, would be a pure control group that receives no contact or a placebo group that is attempted with an unrelated appeal). We further assume that there is a constant treatment effect, and hence that the true variance of the potential outcomes in the subject pool is the same for treated and control subjects (V[Y (0)] = V[Y (1)] = σ 2 ). For simplicity, this variance is assumed to be 1. Given these simplifications:  ITT ) ≈ V(CACE

4σ 2 , N α2

(4.4)

where N is the number of subjects randomly assigned and α is the application or contact rate (ITTd ). With this notation and setup in mind, we now proceed to our discussion of how the four practices we study may increase efficiency. 4.3.2 How the Four Practices Can Increase Efficiency We next formally analyze how each of the four practices we discuss can increase efficiency individually and together. Table 4.1 summarizes our points. It discusses the primary advantages of each of these practices as it has previously been understood, our  the variance of CACE ITT and the variance of  CACE . As previous authors have noted, these Placebo last two terms make little difference in practice. Indeed, the variance of traditional experiments rely ing on CACE ITT is often actually slightly larger than that given in Eq. (4.3), making our comparative  statements about the efficiency of CACE Placebo more conservative. For example, Green et al. (2003) report six GOTV experiments. In our analysis of all six, ignoring the last two terms results in slightly smaller variance estimates: the mean ratio of Eq. (4.3) over Eq. (4.2) is 0.994 across them. Other researchers have also observed that the additional terms are very small (e.g., Angrist 1990; Bloom et al. 1997; Heckman et al. 1994).

65

Field Experiments with Survey Outcomes

Table 4.1 Potential benefits of and complementarities between four methodological practices. Methodological practice Placebo (if applicable)

Baseline survey

Generally understood benefits • Identifies compliers in the control group, facilitating estimation of the CACE with much greater precision than the intent-to-treat estimator and meaning fewer individuals must be treated or surveyed to attain the same precision (Nickerson 2005) • Measures covariates at baseline capable of decreasing sampling variability and allowing theories with predictions for heterogenous effects to be tested (Bloniarz et al. 2016; Gerber and Green 2012)

Multiple measures combined into index

• Reduces measurement error (Cronbach 1951), increasing the value of every observation and reducing the sample size required

Online survey mode

• Allows for additional item formats (e.g., the IAT) and may decrease social desirability bias (Gooch and Vavreck 2019)

Special benefits in field experiments with survey outcomes

Benefits complementing other practices (decreasing costs or increasing benefits of other practices)

• Identifies noncompliers in both groups, allowing noncompliers to be excluded from reinterviews, reducing survey costs

• Increased precision reduces sample size required for baseline survey as well.

• Identifies and establishes a relationship with subjects who can then be reliably reinterviewed, decreasing wasted treatment effort on nonmeasurable subjects

• Identifying subjects who can be reliably reinterviewed reduces the necessary number of placebo interactions, thus decreasing the cost of adopting the placebo design

• Pretreatment outcomes allow sensitive tests for differential attrition

• Allows one to determine if the compliers are the same in treatment and placebo on observed characteristics, decreasing the risk associated with the placebo design • Increases the test–retest correlation between the baseline survey and follow-up survey, allowing the baseline survey to decrease sampling variability more strongly • Can have higher reinterview rates than telephone surveys, strengthening the baseline survey’s ability to identify follow-up respondents • Multiple measures can typically be included less expensively and with less suspicion, decreasing measurement error

results about the special benefits each practice can have in field experiments with survey outcomes, and our results about how each practice can complement others in field experiments with survey outcomes to yield additional improvements. For our formal analysis of how each of these four practices can increase efficiency, we will consider how an experiment’s variable costs cP,B (·) vary with different design choices. P ∈ {0, 1} indicates whether the placebo is used and B ∈ {0, 1} indicates whether a baseline survey is used. Variable

cost c(·) is a function of many variables. To reduce notational clutter, we exclude irrelevant variables in each instance and let the context dictate the parametrization. We focus on how variable cost varies as a function of the required sample size N or a desired variance V ∗ , the number of rounds of posttreatment follow-up surveys one wishes to conduct F (e.g., for F = 2 if one wants to test both whether there is an initial treatment effect and then whether any effect lasts in a subsequent round of surveying), and, when considered, the marginal cost of attempting

66

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Table 4.2 Notation and values used in the examples. Notation

Definition

σ2 V∗ F

Design parameters True variance of potential outcomes Target variance of a prospective study Number of rounds of post-treatment follow-up surveys

N

α T

SMode∈{O,T},Measures∈{S,M} RWave∈{1,2},Mode∈{O,T}

2 ρMode∈{O,T},Measures∈{S,M}

Value used In examples

1 0.002 2

Treatment parameters Number of subjects assigned to treatment and control or placebo in total, with N 2 assigned to each condition Proportion of subjects attempted for treatment that are successfully treated Marginal cost of attempting treatment or placebo contact Survey parameters Marginal cost of completed survey; with either online or telephone mode and single or multiple measures Response rate to a first (1) or second (2) round of surveys, collected online (O) or by telephone (T). A first round of surveys could refer to a baseline survey before treatment or an endline survey after treatment when there has been no baseline survey. A second round implies only subjects who answered a first round of surveys are solicited R2 of regression of outcome at follow-up on pretreatment covariates at baseline; with either online or telephone mode and single or multiple measures

treatment T and conducting a survey S. Table 4.2 lists the notation and the parameter values from our empirical studies that we will use in our examples.

4.3.2.1 practice 1: placebo If failure to treat can occur and be observed (such as in a door-to-door canvass intervention where a subject is not home and a canvasser can record that the subject did not come to the door), a placebo condition can increase efficiency dramatically (Nickerson 2005). As explained above, in an experiment with a placebo condition, subjects in the control group are contacted with an unrelated appeal. The purpose of these placebo contacts

1 4

$3

$5, except ST,M = $10 R1,O = 0.07, R1,T = 0.07, R2,O = 0.75, R2,T = 0.35

2 = 0.25, ρO,S 2 ρO,M = 0.81, 2 = 0.16, ρT,S 2 ρT,M = 0.33

is to identify control subjects to whom treatment could be delivered – that is, to identify whether control group subjects are compliers or noncompliers. Subjects in each group who open the door and identify themselves before either the treatment or placebo begins are then used as the basis for comparison when estimating the CACE. The variance of the CACE estimator with the placebo design is:  Placebo ) = V(CACE

4σ 2 , Nα

(4.5)

where α is the fraction of the N subjects who are contacted, such that N α is the number of

Field Experiments with Survey Outcomes

contacted subjects whose outcomes are compared during estimation. As Nickerson (2005)  Placebo is unbiased under several shows, CACE assumptions: “(1) the [treatment and placebo] have identical compliance profiles; (2) the placebo does not affect the dependent variable; and (3) the same type of person drops out of the experiment for the two groups.” As previously studied, the benefit of the placebo design is that, when contact rates are low, the placebo design can reduce the number of subjects with whom contact must be attempted because it overcomes the problem of diluted treatment effects in the presence of noncompliance (Nickerson 2005). To see this advantage, let T be the marginal cost of attempting to contact a subject to deliver the treatment or placebo (such as the price a paid canvassing firm charges or the opportunity cost of a graduate student’s time “per knock”). Considering only the cost of attempting to treat subjects, the cost of implementing the traditional design in a sample of size N with no placebo and no baseline survey is cP=0,B=0 (N , T) = 12 NT, as only the 12 N subjects in the treatment group are attempted to be contacted. Suppose an experiment is being planned with the aim of achieving an estimate with variance V ∗ . Using Eq. (4.4), delivering treatment in the traditional design thus costs cP=0,B=0 (V ∗ , T) ≈ 12 × 2 2 4( Vσ ∗ )( α12 )T = 2( Vσ ∗ )( α12 )T. In the placebo design, control group subjects are attempted with the placebo contact. Contact is therefore attempted with all N subjects, such that cP=1,B=0 (N , T) = NT. Using Eq. (4.5), delivering treatment in the placebo design 2 costs cP=1,B=0 (V ∗ , T) = 4( Vσ ∗ )( α1 )T. When focusing on the cost of treatment delivery alone, the placebo design is therefore 2 2 cheaper when 4( Vσ ∗ )( α1 )T < 2( Vσ ∗ )( α12 )T, 1 which reduces to α < 2 (Nickerson 2005). Intuitively, when the compliance rate is high, the placebo is less beneficial (because adjusting for compliance does not inflate standard errors as much); but when compliance rates are low, designs with placebos can yield very large gains in precision and thus require smaller sample sizes and that

67

fewer individuals be contacted, holding precision fixed. Less well appreciated is that a placebo can produce even larger efficiency gains in field experiments with survey outcomes because noncompliers need not be surveyed. Without a placebo, all subjects must be surveyed. Incorporating the cost of surveying, cP=0,B=0 (N , F, T, S) = N ( 12 T + FS), where F is the number of rounds of post-treatment follow-up surveys and S is the marginal cost of a survey, assuming a 100% survey response rate for now. To achieve an estimate with some desired variance V ∗ , using Eq. (4.4) reveals that the traditional design will cost 2 cP=0,B=0 (V ∗ , F, T, S) ≈ 4( Vσ ∗ )( α12 )( 12 T + FS). Supposing an example contact rate of 2 α = 1/4, cP=0,B=0 (V ∗ , F, T, S) ≈ ( Vσ ∗ )(32T + 64FS). However, with a placebo, “the group receiving the placebo can serve as the baseline for comparison for the treatment group” (Nickerson 2005). This means subjects who are not successfully contacted in the treatment or placebo groups – all noncompliers – do not need to be surveyed. This reduces survey costs. Incorporating the cost of surveying the αN compliers only, cP=1,B=0 (N , F, T, S) = N (T + FS). Using Eq. (4.5), the placebo design will cost 2 cP=1,B=0 (V ∗ , F, T, S) = 4( Vσ ∗ )( α1 )(T + FS). Again supposing α = 1/4, this reduces to 2 cP=1,B=0 (V ∗ , F, T, S) = ( Vσ ∗ )(16T + 4FS). Note that with α = 1/4, the placebo reduces the costs associated with delivering treatment by half (32T to 16T), but it reduces survey costs 16-fold (64FS to 4FS). With F = 2, T = 3, and S = 5, this is equivalent to an 88% decrease in variable costs. Illustrating the first way in which the practices we study can complement each other, a placebo also reduces the costs of baseline surveys by reducing the number of subjects who must be recruited to a pretreatment baseline if one is used (again due to the more precise estimates a placebo design produces, which are themselves due to the fact that estimates of the difference between compliers in the treatment and control groups are no longer clouded by the presence of noncompliers in comparisons of the treatment and control

68

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

groups). To see this, suppose a baseline survey of N subjects is conducted before treatment. Let the marginal cost of each baseline survey also be S. The baseline’s variable costs thus are NS. The gross variable cost of incorporating a baseline is an increase in costs of 2 4( Vσ ∗ )( α12 )S under the traditional design and 2 only 4( Vσ ∗ )( α1 )S with a placebo. If α = 1/4, a placebo makes the baseline 75% cheaper to implement.

4.3.2.2 practice 2: pretreatment baseline survey A pretreatment baseline survey can increase power in two ways. First, and most obviously, baseline surveys can capture pretreatment covariates that analysts can use to increase precision. This can decrease costs because smaller sample sizes are required to attain a given level of statistical power for estimating the treatment effect. Second, and less obviously, baseline surveys can also decrease treatment costs by identifying subjects who are more likely to be interviewed after treatment. If survey response rates are low, many subjects must be treated to yield each survey response for analysis. By identifying and establishing relationships with subjects who can reliably be resurveyed and only delivering treatment to these subjects, a baseline survey can dramatically reduce wasted effort treating subjects whose outcomes cannot be measured.7 To see these advantages, we will now incorporate survey nonresponse and pretreatment covariates into our analysis and consider the 7 Field experiments with survey outcomes only produce estimates of causal effects among individuals who answer surveys. This is true regardless of whether individuals who do not answer surveys are removed prior to random assignment or, in a design with only post-treatment surveys, because outcomes cannot be measured for them. Removing individuals who do not answer surveys from the sample before random assignment is therefore conceptually similar to not being able to measure outcomes for individuals who do not answer post-treatment surveys; in either case, estimates pertain only to those who answer surveys. In practice, it could be possible that those who answer both pre- and post-treatment surveys are systematically different from those who would answer at least a post-treatment survey, but this is an empirical question.

differences between a design with or without a baseline survey. For now, we will assume a placebo is used and outcomes are collected by telephone survey. First, consider a design using a placebo, a post-treatment telephone survey, and no baseline survey. Let R1,T represent the response rate to the post-treatment telephone survey among the compliers an analyst attempts to survey, where the subscripts indicate subjects are being surveyed for the first time and by telephone. If N subjects are randomly assigned, then N α compliers are contacted, and then N αR1,T complier-reporters are surveyed via telephone, Eq. (4.5) shows the variance of this design will be:  P=1,B=0 ) = V(CACE

4σ 2 . N αR1,T

(4.6)

The cost of this design with a placebo, no baseline, and a telephone survey that collects a single outcome measure is: cP=1,B=0 (N , F, T, S) = FN αR1,T ST + NT, (4.7) where the first term captures the cost of surveying the N αR1,T subjects who complete each round of post-treatment telephone surveys, which carries a marginal cost ST for each of F rounds of surveying; NT captures the cost of attempting to contact N subjects with marginal cost of treatment T. Using Eqs. (4.6) and (4.7), to achieve some desired variance V ∗ , this telephone-based design would cost: cP=1,B=0 (V ∗ , F, T, S)   2  T σ FST + . =4 V∗ αR1,T

(4.8)

Note how Eq. (4.8) shows that low response rates to post-treatment telephone surveys R1,T increase the cost associated with treatment. Now consider the design with a pretreatment online survey and a follow-up online survey. Let ρ 2 be the R2 of a regression of the outcome on pretreatment covariates from

Field Experiments with Survey Outcomes

the baseline survey and R2,O be the response rate to an online follow-up survey among those who completed a baseline, with subscripts indicating that the follow-up survey is the second time subjects are being surveyed (the first being the baseline) and the online mode (which we will discuss shortly). This design has variance:  P=1,B=1 ) = V(CACE

4σ 2 (1 − ρ 2 ) . N αR2,O

(4.9)

The cost of such a study would be: cP=1,B=1 (N , F, T, S) = FN αR2,O SO + NT + NSO ,

(4.10)

where SO is the marginal cost of an online baseline survey and SO is also the marginal cost of an online follow-up survey. Using Eqs. (4.9) and (4.10), to achieve some desired variance V ∗ , a design with a baseline survey would cost: cP=1,B=1 (V ∗ , F, T, S)     σ2  T + SO 2 FSO + . =4 1−ρ V∗ αR2,O (4.11) Equation (4.11) highlights the potential efficiency gains of a baseline survey in two ways. To see these potential gains, compare Eqs. (4.8) and (4.11). First, costs decrease when baseline survey items are prognostic of the ultimate outcome; (1 − ρ 2 ) shrinks the entire cost because the necessary sample size is lower. Second, whereas telephone survey response rates (R1,T ) are often lower than 10% in developed countries, we have observed response rates to follow-up surveys among those who have already completed baseline surveys (R2,O ) of about 75% or more (see online appendix of Broockman et al. (2017). When R1,T < R2,O , this reduces the number of subjects that must be treated, and therefore the cost of treatment, in anticipation that a higher share of treated subjects can be surveyed. Again illustrating how the practices we study can complement each other,

69

the baseline survey can also dramatically decrease the cost of using a placebo. When a placebo is used but a baseline survey is not, many placebo conversations are wasted on subjects whose outcomes cannot be measured because they will not complete a phone survey. A baseline survey can reduce placebo costs by reducing the number of placebo conversations wasted on nonresponders and, with prognostic pretreatment covariates, increasing the value of every successful placebo conversation.8 The ratio of these (1−ρ 2 )R1,T costs is . With the parameter values R2,O in Table 4.2, a placebo costs about 1.8% of what it would cost to implement with traditional designs. A baseline survey can also help researchers detect or attempt to adjust for differential survey attrition or improper implementation of a placebo.9 There may also be substantive motivations for using a baseline survey if certain constructs need to be measured prior to treatment. For example, studies of how treatment effects vary by other constructs that are measured in surveys (e.g., whether the treatment effects of a mailpiece are heterogeneous with respect to subjects’ level of racial resentment) should measure these constructs prior to treatment to avoid potential post-treatment bias (Gerber and Green 2012; Montgomery et al. 2018). 8 In particular, with a telephone post-treatment survey only, the cost of placebo conversations was σ 2 )( T ). Under the design with a baseline 4( V ∗ 2αR 1,T

online survey, placebo conversation costs are 4(1−ρ 2 ) σ 2 )( T ) instead. (V ∗ 2αR 2,O

9 As described in Section 4.2.1.3, differential attrition occurs when the treatment influences who completes a survey. It can bias estimates severely, but is often difficult to detect see (Gerber and Green 2012, ch. 7). However, prognostic baseline covariates allow for differential attrition to be detected more sensitively and, if it does occur, for adjustment models to be applied more persuasively (e.g., Bailey et al. 2016). Likewise, if a placebo is used, the baseline survey also makes the placebo design less risky to implement because it helps one detect whether compliers in each condition differ on baseline outcomes; if implementation of the placebo is found to fail, prognostic baseline covariates may help adjustment models be applied more persuasively. Gerber and Green (2012) describe how to test for covariate balance and differential attrition.

70

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

4.3.2.3 practice 3: multiple measures analyzed as an index Equation (4.11) showed how higher test– retest correlations ρ between baseline and outcome measurements increase efficiency. Due to measurement error, one item may have a small correlation between two survey waves even if the underlying attitude it measures is stable. However, when multiple measures of an attitude are collected and combined into an index, stability between survey waves can increase considerably (Ansolabehere et al. 2008; Cronbach 1951). This increase in stability can increase the precision of estimates dramatically, increasing efficiency. Empirical values from the application study reported in Broockman et al. (2017) illustrate the magnitude of these potential gains. In that study, analyzing an index of multiple items instead of only one item increases the test–retest correlation ρ to 0.9 from an average of 0.5. This corresponds to a more than threefold increase to 0.81 from 0.25 for the ρ 2 used in Eq. (4.11), and thus a more than threefold decrease in costs. Without multiple measures, baseline surveys are less useful for reducing sampling error. However, with multiple measures, baselines can reduce sampling error tremendously. Note that multiple measures can increase precision even when one item is fairly stable; for example, increasing ρ from 0.90 to 0.95 would decrease costs by roughly half. Although psychology research consistently collects multiple measures to form an index, this practice is surprisingly rare in existing political science field experiments with survey outcomes. We suspect the reason for this has to do with the costs of collecting multiple measures by telephone, a point to which we turn now.

4.3.2.4 practice 4: online survey mode The fourth practice we study is recruiting individuals to online surveys from a defined sampling frame, such as a list of registered voters (as in our empirical studies), a list of all addresses (Jackman and Spahn 2015), Federal Election Commission (FEC) donor lists

(Barber et al. 2017; Broockman and Malhotra 2019), a list of physicians (Hersh and Goldenberg 2016), or one of many others (see Cheung 2005). Online surveys can complement the practices studied above in three major ways. First, online surveys can increase reinterview rates after baseline surveys, increasing R2 . That is, we have observed R2,O > R2,T , likely because the first survey can capture additional contact information for each respondent (e.g., an email address) and easily provide them with incentives (e.g., a gift card). Increases in R2 increase the value of baseline surveys. In our work so far, reinterview rates in this mode have sometimes exceeded R2,O = 80%. However, reinterview rates on the phone can be considerably lower; we have observed R2,T = 35%. Second, while collecting multiple measures can be expensive by telephone, surveys that collect multiple measures can be cheaper to administer online (i.e., we have observed SO,M < ST,M ). For every question in a live telephone survey, an interviewer must read the question and record respondents’ answers. We expect telephone surveys rarely collect multiple measures for this reason. Online surveys rarely carry a high perquestion cost. The $5 incentives we have provided for surveys of over 50 questions are much smaller than quotes we have received for telephone surveys of this length (Broockman et al. 2017). Third, online surveys may have higher 2 > test–retest reliabilities, such that ρO,M 2 ρT,M , which we observe in the first empirical study reported in Broockman et al. (2017). Such potential increases in R2 and ρ 2 and decreases in SM mean collecting outcomes by online panels has the potential to achieve the same precision for less cost than by other survey modes.10 With this said,

10 For example, consider the alternative of phone panels. Using Eq. (4.11), the ratio of treatment and baseline survey costs N (T + FS) for an online panel design and a telephone panel design would be 2 )/R (1−ρT,S 2,T

2 (1−ρO,M )/R2,O

2 . With ρT,S = 0.16 for one item in a

Field Experiments with Survey Outcomes

two major concerns about online surveys bear mentioning. First, when studies are conducted in other settings (e.g., with low Internet penetration rates or high response rates via other modes, such as in-person surveys), many of these parameter values may change, resulting in different optimal designs. Second, respondents to online surveys may prove less representative than those recruited with traditional modes. Representativeness is not always a primary concern in experiments. However, to the extent that it is, we recommend recruiting respondents from an ex ante defined sampling frame. Existing evidence suggests online respondents recruited from a defined frame can be more representative than those who “opt in” to online surveys (e.g., Brüggen et al. 2016). More importantly, being able to compare respondents to a defined frame facilitates empirical examination of how representative a sample is on observables. Researchers should also think critically about how unobservable characteristics of those who respond to any survey mode might affect their conclusions. With this said, in the first empirical study reported in Broockman et al. (2017), we find that online panels in the USA are fairly representative and typically more representative than telephone surveys.11 2 telephone survey, ρO,M = 0.81 for multiple measures in an online survey, R2,T = 0.35 for telephone survey reinterview rates, and R2,O = 0.75 for online survey reinterview rates (see online appendix of Broockman et al. (2017), this ratio of treatment of survey costs (1−.16)/.35

between modes is (1−.81)/.75 ≈ 9. For the small costs associated with the follow-up surveys, the ratio 1−ρ 2

1−.25

is 1−ρ 2T,S = 1−.81 ≈ 4. Using the parameters from O,M Table 4.2, the ratio of the total costs is ≈8.5. Although exact parameters will vary from study to study, this suggests that field experiments that collect outcomes with online survey panels can be nearly an order of magnitude cheaper than field experiments collecting outcomes with other survey modes. 11 Most field experiments with survey outcomes also cannot use panelists in existing Internet-based panels as subjects because survey companies rarely will share the panelists’ personal information with researchers, which usually would be needed to separately deliver a treatment in the real world to panelists. The experiments described in this chapter therefore assume that researchers conduct their own surveys using subjects specifically for the experiment (e.g., recruited using information available on voter files). However, in

71

Of course, there may also be substantive motivations for using online surveys. Online surveys may allow for additional item formats (e.g., the Implicit Association Test (IAT)). See also Chapter 13 in this volume regarding strategies for collecting behavioral outcomes in surveys, many of which are especially suited to online surveys.

4.4 A Framework for Selecting Experimental Designs Scholars wishing to conduct a field experiment with survey outcomes may encounter substantially different design parameters than those explored in the running examples and Table 4.2. In this section, we provide a framework for how to use the formulas we derived in the previous section to select more efficient designs. We also provide several examples of how scholars can apply this framework across diverse applications, to their particular questions and setting. These examples will also reinforce our argument that complementarities between these practices can produce large advantages. However, some of our examples also show that, under other parameter values, using some of these practices alone or in combination with others can increase costs (e.g., Example 2). Table 4.3 organizes our analytical results derived in the previous section. As we will show, these formulas allow researchers to compute the variances and costs of potential experimental designs as a generic function of the parameters in their settings under alternative permutations of the four design practices we have discussed. The notation in Table 4.3 corresponds to the same notation defined in Table 4.2. Subtable 4.3 gives the variances and costs of alternative designs depending on the presence or absence of a placebo, baseline survey, multiple measures, and online survey mode for cases when compliance can be observed and so a placebo theory, an existing Internet panel could be used were the panel provider to cooperate with the use of panelists in a field experiment. The framework we present below can accommodate this possibility.

72

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Table 4.3 Variances and variable costs of alternative designs. (a) When placebo is possible Placebo?

Baseline?

V(σ , ρ, N , α, R)

c(N , ·)

c(V ∗ , ·)





4σ 2 (1−ρ 2 ) N αR2

NFαR2 S + NT + NS

σ 2 )(1 − ρ 2 )(FS + T + S ) 4( V ∗ αR2 αR2



No

4σ 2 N αR1

NFαR1 S + NT

σ 2 )(FS + T ) 4( V ∗ αR

No



4σ 2 (1−ρ 2 ) N α 2 R2

NFR2 S + 21 NT + NS

σ 2 )( 1 )(1 − ρ 2 )(FS + T + S ) 4( V ∗ α2 2R2 R2

No

No

4σ 2 N α 2 R1

NFR1 S + 21 NT

σ 2 )( 1 )(FS + T ) 4( V ∗ α2 2R

1

1

(b) When placebo is not possible Baseline?

V(σ , ρ, N , α, R)

c(N , ·)

c(V ∗ , ·)



4σ 2 (1−ρ 2 ) NR2

NFR2 S + 21 NT + NS

σ 2 )(1 − ρ 2 )(FS + T + S ) 4( V ∗ 2R2 R2

No

4σ 2 NR1

NFR1 S + 21 NT

σ 2 )(FS + T ) 4( V ∗ 2R

is possible. The presence or absence of placebos and baseline surveys changes these formulas. Survey mode and the presence or absence of multiple measures may change parameters in these formulas, but not the formulas themselves. Subtable 4.3 gives the same but for settings where a placebo is not possible because compliance cannot be observed (such as when the treatment is delivered via the mail). 4.4.1 Example 1: Door-to-Door Canvassing Study in the USA Figure 4.1 at the beginning of the chapter previewed how a researcher could use our framework to determine the costs of each of 16 ways to conduct a door-to-door canvassing study under a given set of empirical parameters. The results in Figure 4.1 follow from plugging in the parameters from Table 4.2 to the formulas in Subtable 4.3. In that application, our framework found a design with variable costs that were approximately 98% lower than common designs. 4.4.2 Example 2: Mailing Information about Members of Congress In some settings, a placebo is not possible because compliance cannot be observed.

1

Suppose a researcher wants to examine how individuals learn and retain information about their Members of Congress. A researcher might want to include individuals in many Congressional districts to expand the generalizability of the conclusions. A doorto-door canvass treatment would be difficult to deploy on this nationwide basis, but a mail experiment would be practical. However, one cannot easily observe whether a person opens a piece of physical mail, so a placebo could not be used.12 Subtable 4.3b gives formulas for alternative designs in situations where a placebo is not possible. To select the optimal design, we will use these formulas and again use the values in Table 4.2, but substitute T = $1, corresponding to an example mail treatment with a marginal cost of $1. Figure 4.3 provides the results of applying our framework to this experimental design 12 Unless it is possible to observe compliance (in this case, whether someone opens the mail), there would be no gains from attempting to administer a placebo (i.e., from sending a separate piece of mail to the control group). A placebo is only beneficial insofar as it allows one to identify compliers. A post-treatment survey question asking whether individuals recalled receiving the mail would not be a reliable approach to identifying compliers given that the contents of the treatment or placebo mail may cause subjects to systematically misremember.

73

Field Experiments with Survey Outcomes Pre=Yes, Multiple Measures=Yes, Mode=Phone Pre=No, Multiple Measures=Yes, Mode=Phone

Design

Pre=Yes, Multiple Measures=No, Mode=Phone

Traditional Design

Pre=No, Multiple Measures=No, Mode=Phone Pre=No, Multiple Measures=No, Mode=Online Pre=No, Multiple Measures=Yes, Mode=Online Pre=Yes, Multiple Measures=No, Mode=Online

All Three Practices

Pre=Yes, Multiple Measures=Yes, Mode=Online $0

$20,000

$40,000

$60,000

Variable Cost

Figure 4.3 Applying the framework when placebo is not possible: mail example.

problem. Under these conditions, employing all three possible practices reduces variable costs from approximately $34,285 if none of these practices are used to approximately $6587. Interestingly, in this application, our framework also shows that using each of two of these practices alone may actually increase variable costs, a conclusion driven by our expectation that multiple measures and panels are especially expensive to administer by telephone. 4.4.3 Example 3: The World Bank Studying a Public Health Intervention in Liberia Our motivating examples so far have considered how to study the effects of field treatments on political attitudes in the USA, but our framework is much more general. Moreover, it can show how different designs may be more optimal for researchers pursuing different aims in different contexts. As an example of how our framework can be extended to a different setting, we consider a recent study by the World Bank examining how Ebola infections affected self-reported outcomes such as employment and schooling in Liberia (Himelein 2015). These outcomes were collected in a telephone panel survey. Suppose these researchers wanted to conduct a field experiment in Liberia to estimate the effect of a public health worker visiting households and providing public health information about avoiding Ebola on these outcomes. Our framework is first able to identify the key parameters of interest that researchers must forecast to determine which

designs would be optimal. In their Liberia study, researchers from the World Bank conducted five rounds of mobile telephone surveys (F = 5). The initial survey response rate (R1,T ) was 28% and the follow-up telephone survey response rate (R2,T ) was 73%. Indicative of the share of people who can be reached at home in Liberia when one knocks on their door, the contact rate in the face-to-face Afrobarometer survey conducted in May 2015 in Liberia (Isbell 2016) was 97%, so we assume a treatment application rate α = 0.97. For illustrative purposes, suppose in Liberia an attempted visit from a public health worker is inexpensive given lower wages, such that T = $1, but that online surveys would be much more expensive because many people do not have Internet access and would need to be provided with it (SO = $25). For the sake of simplicity, we let online and telephone surveys have the same response rates (R1,T = R1,O , R2,T = R2,O ) and let V ∗ , σ 2 , ρ 2 , and ST remain unchanged from Table 4.2. Figure 4.4 applies the abovementioned assumed costs, contact rates, response rates, and test–retest correlation to the equations listed in the fourth column of Table 4.3 to calculate the costs under the various designs. In this example, using all four practices that we study would not be the most costefficient option, nor would the traditional design in the literature. Instead, the most cost-efficient option would be a telephone survey with a baseline survey and placebo but without multiple measures, to keep the survey short.

74

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon Pre=No, Placebo=No, Multiple Measures=No, Mode=Online Pre=No, Placebo=No, Multiple Measures=Yes, Mode=Online Pre=No, Placebo=Yes, Multiple Measures=No, Mode=Online Pre=No, Placebo=Yes, Multiple Measures=Yes, Mode=Online Pre=Yes, Placebo=No, Multiple Measures=No, Mode=Online Pre=Yes, Placebo=Yes, Multiple Measures=No, Mode=Online

Design

Pre=No, Placebo=No, Multiple Measures=Yes, Mode=Phone Pre=No, Placebo=Yes, Multiple Measures=Yes, Mode=Phone Pre=Yes, Placebo=No, Multiple Measures=Yes, Mode=Phone Pre=Yes, Placebo=Yes, Multiple Measures=Yes, Mode=Phone Pre=Yes, Placebo=No, Multiple Measures=Yes, Mode=Online

All Four Practices

Pre=Yes, Placebo=Yes, Multiple Measures=Yes, Mode=Online Pre=Yes, Placebo=No, Multiple Measures=No, Mode=Phone Pre=No, Placebo=Yes, Multiple Measures=No, Mode=Phone

Traditional Design

Pre=No, Placebo=No, Multiple Measures=No, Mode=Phone Pre=Yes, Placebo=Yes, Multiple Measures=No, Mode=Phone

$0

$100,000

$200,000

Variable Cost

Figure 4.4 Example results: variable costs for studying public health intervention in Liberia.

The results in Figure 4.4 could also help these researchers navigate more complicated trade-offs. Suppose a collaborating nongovernmental organization refused to implement a placebo condition. The researchers could now deduce that conducting a baseline survey is not optimal given that there will be no placebo, even though a baseline survey was optimal when the placebo was present. Alternatively, suppose the researchers wanted to collect multiple measures of outcomes to match existing questionnaires. Our framework now suggests that conducting an online survey may be worth the additional cost, as the parameters we input assumed that a phone survey collecting multiple measures offered less cost savings on marginal costs than a short phone survey. These examples illustrate how our framework can demonstrate subtle complementarities and trade-offs between these practices. Our framework also allows researchers to consider more traditional trade-offs. For example, suppose researchers considered using an online survey without providing Internet access to those who did not have it, thereby limiting the sampling frame to preexisting Internet users but eliminating the cost of

providing Internet access. Our framework would allow researchers to compute the money that this choice would save and allow them to consider whether this cost saving was worth the potential external validity limitations this would introduce.

4.5 Concluding Discussion The use of randomized experiments and survey-based research in the social sciences has mushroomed. Together with rising interest in these methodologies, many scholars have begun to conduct field experiments with survey outcomes: experiments where outcomes are measured by surveys, but randomized treatments are delivered by a separate mechanism in the real world. However, challenges familiar to experimental researchers and survey researchers – survey nonresponse, survey measurement error, and treatment noncompliance – mean that common designs for field experiments with survey outcomes are extremely expensive. In this chapter, we reviewed how four practices uncommon in such experiments can yield particularly large gains in efficiency

Field Experiments with Survey Outcomes

and robustness when they are used in combination. We also reviewed a framework that will help researchers select the design that is most optimal in diverse settings where treatment costs, survey costs, survey response rates, and other parameters may change. This framework identifies the key parameters that determine an experiment’s variable costs and allows researchers to examine the cost of a range of possible designs given these parameters. As we discussed, this framework is widely applicable and easily extensible. For example, researchers could use it to internalize the ethical externalities of treating many subjects (by using a larger value for the cost of treatment than the financial cost alone) or to quantify the costs of introducing design practices expected to increase robustness. To accompany this chapter, we are also making code available at http://experiments.berkeley.edu and https:// github.com/dbroockman/repeated-onlinepanel-experiments that provides examples and a calculator for determining sample sizes and costs using this framework. We acknowledge that the costs of conducting field experiments with survey outcomes will still typically be substantial. Researchers should of course carefully consider whether a reasonably well-powered experiment is feasible given their budget constraints. In practice, many of the existing experiments in the literature were conducted in collaboration with partner organizations, who can help cover costs or secure external funding to do so by leveraging their existing funding relationships. See also Chapter 11 in this volume about collaborations with partner organizations. Although we are optimistic about the potential applications of the practices we study, several open questions remain. First, the estimated treatment effects in all experiments with survey outcomes are driven by the individuals who receive the treatment (compliers) and are specific to those who agree to be surveyed (reporters). These complier-reporters might differ in meaningful ways from the rest of the population. Future research should assess how treatment effects may vary across

75

populations, such as by using additional incentives or more intensive treatment delivery efforts to increase the diversity of complier-reporters (Coppock et al. 2017). See also Chapter 21 in this volume for more discussion of generalizing experimental results. Second, treatment effects measured in survey data may overstate individuals’ attitude changes through social desirability bias or demand effects. For example, in a recent field experiment measuring the effect of providing information about the performance of incumbent legislators on voter turnout and incumbent support in Benin, Adida et al. (2019) compare the treatment effect estimates from panel survey data to official election results, finding that survey respondents consistently overreport turning out to vote and voting for the incumbent relative to the official election results. It should be a priority for future research to assess the extent to which treatment effect estimates, heterogeneity, and substantive conclusions from field experiments with survey outcomes can be recovered in experiments with administrative data outcomes, such as through the use of precinct-randomized experiments analyzing actual election returns (Arceneaux 2005). To state the obvious, the ability to conduct field experiments with survey outcomes in no way addresses the potential shortcomings of survey-based outcomes nor reduces the appeal of conducting experiments with behavioral outcomes. Third, baseline surveys may have unintended effects that produce bias or reduce external validity. For example, answering survey questions about a topic might change how people later process information about it, such as by increasing attentiveness (e.g., Bidwell et al. 2015). Most evidence on this phenomenon is either from developing countries or dates from several decades ago, so it is unclear to what extent present-day populations in developed countries would exhibit such effects. This is an important area for future research, with designs readily available in the classic psychometric literature (e.g., Solomon 1949). Answering multiple

76

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

follow-up surveys after a treatment may also produce biased estimates of treatment effects’ persistence over time if subjects remember how they answered particular questions in a previous survey wave. Another important priority for future research is to consider assessing the extent to which this artificial persistence occurs and developing strategies to detect and reduce it. The particular implementation of each of the practices we studied may also be open to improvement. For example, one possible extension to conducting a baseline survey is to conduct multiple baseline waves prior to treatment. Multiple baselines would further increase stability (increasing ρ 2 ) (McKenzie 2012) and could help identify subjects who are even more likely to participate again (increasing R). Our framework could be readily applied to determine whether the costs of an additional baseline wave prior to treatment would outweigh these benefits. In addition, new incentive structures could increase the rates at which individuals respond to surveys. A final important priority for future work is to consider how best to combine multiple variables into indices in experiments in a way that maximizes statistical power while preserving interpretability (e.g., Zhang et al. 2019). Factor analytic methods may put much more weight on some variables than others for reasons unrelated to the substantive motivation behind an experiment and unrelated to the pattern of treatment effects across items.

References Adida, Claire, Jessica Gottlieb, Eric Kramon and Gwyneth McClendon. 2019. “Response Bias in Survey Measures of Voter Behavior: Implications for Measurement and Inference.” Journal of Experimental Political Science 6(3): 192–198. Angrist, Joshua D. 1990. “ERRATA: Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.” American Economic Review 80(5): 1284–1286. Ansolabehere, Stephen, Jonathan Rodden, and James M. Snyder. 2008. “The Strength of

Issues: Using Multiple Measures to Gauge Preference Stability, Ideological Constraint, and Issue Voting.” American Political Science Review 102: 215–232. Arceneaux, Kevin. 2005. “Using Cluster Randomized Field Experiments to Study Voting Behavior.” Annals of the American Academy of Political and Social Science 601(1): 169–179. Bailey, Michael A., Daniel J. Hopkins, and Todd Rogers. 2016. “Unresponsive, Unpersuaded: The Unintended Consequences of Voter Persuasion Efforts.” Political Behavior 38: 713–746. Barabas, Jason, and Jennifer Jerit. 2010. “Are Survey Experiments Externally Valid?” American Political Science Review 104(2): 226–242. Barber, Michael J., Brandice Canes-Wrone, and Sharece Thrower. 2017. “Ideologically Sophisticated Donors: Which Candidates Do Individual Contributors Finance?” American Journal of Political Science 61(2): 271–288. Barber, Michael J., Christopher B. Mann, J. Quin Monson, and Kelly D. Patterson. 2014. “Online Polls and Registration-Based Sampling: A New Method for Pre-Election Polling.” Political Analysis 22(3): 321–335. Bidwell, Kelly, Katherine Casey, and Rachel Glennerster. 2015. “DEBATES: The Impacts of Voter Knowledge Initiatives in Sierra Leone.” Working Paper, Stanford Graduate School of Business. URL: www.gsb.stanford.edu/gsbcmis/gsb-cmis-download-auth/362906. Blattman, Christopher, Horacio Larreguy Arbesu, Benjamin Marx, and Otis Reid. 2019. “Eat Widely, Vote Wisely? Lessons from a Campaign Against Vote Buying in Uganda.” Working paper. URL: https://papers.ssrn.com/sol3/ papers.cfm?abstract_id=3451428. Bloniarz, Adam, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S. Sekhon, and Bin Yu. 2016. “Lasso Adjustments of Treatment Effect Estimates in Randomized Experiments.” Proceedings of the National Academy of Sciences of the United States of America 113(27): 7383–7390. Bloom, Howard S., Larry L. Orr, Stephen H. Bell, George Cave, Fred Doolittle, Winston Lin, and Johannes M. Bos. 1997. “The Benefits and Costs of JTPA Title II-A Programs: Key Findings from the National Job Training Partnership Act Study.” Journal of Human Resources 32(3): 549–576. Broockman, David E., and Joshua L. Kalla. 2016. “Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing.” Science 352(6282): 220–224.

Field Experiments with Survey Outcomes Broockman, David E., Joshua L. Kalla, and Jasjeet S. Sekhon. 2017. “The Design of Field Experiments with Survey Outcomes: A FrameWork for Selecting more Efficient, Robust, and Ethical Designs.” Political Analysis 25(4): 435–464. Broockman, David, and Neil Malhotra. 2019. “What Do Donors Want? Heterogeneity by Party and Policy Domain.” Working paper. Brüggen, E., J. van den Brakel, and Jon Krosnick. 2016. “Establishing the Accuracy of Online Panels for Survey Research.” Working paper. URL:www.cbs.nl/en-gb/background/2016/15/ establishing-the-accuracy-of-online-panelsfor-survey-research. Cheung, Paul. 2005. Designing Household Survey Samples: Practical Guidelines. Number 98 in “Studies in methods Series F,” United Nations. Coppock, Alexander, Alan S. Gerber, Donald P. Green, and Holger L. Kern. 2017. “Combining Double Sampling and Bounds to Address Nonignorable Missing Outcomes in Randomized Experiments.” Political Analysis 25(2): 188–206. Cronbach, Lee J. 1951. “Coefficient Alpha and the Internal Structure of Tests.” Psychometrika 16(3): 297–334. Druckman, James N., Donald P. Green, James H. Kuklinski, and Arthur Lupia. 2006. “The Growth and Development of Experimental Research in Political Science.” American Political Science Review 100(4): 627–635. Druckman, James N., and Thomas J. Leeper. 2012. “Learning More from Political Communication Experiments: Pretreatment and Its Effects.” American Journal of Political Science 56(4): 875–896. Gaines, Brian J., James H. Kuklinski, and Paul J. Quirk. 2007. “The Logic of the Survey Experiment Reexamined.” Political Analysis 15(1): 1–20. Gerber, Alan S., Dean Karlan, and Daniel Bergan. 2009. “Does the Media Matter? A Field Experiment Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American Economic Journal: Applied Economics 1(2): 35–52. Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W. W. Norton. Gooch, Andrew, and Lynn Vavreck. 2019. “How Face-to-Face Interviews and Cognitive Skill Affect Item Non-Response: A Randomized Experiment Assigning Mode of Interview.” Political Science Research and Methods 7(1): 143–162.

77

Green, Donald P., Alan S. Gerber, and David W. Nickerson. 2003. “Getting out the Vote in Local Elections: Results from Six Door-toDoor Canvassing Experiments.” Journal of Politics 65(4): 1083–1096. Heckman, James, Jeffrey Smith, and Christopher Taber. 1994. “Accounting for Dropouts in Evaluations of Social Experiments.” URL: www .nber.org/papers/t0166.pdf. Hersh, Eitan D., and Brian F. Schaffner. 2013. “Targeted Campaign Appeals and the Value of Ambiguity.” Journal of Politics 75(2): 520–534. Hersh, Eitan D. and Matthew N. Goldenberg. 2016. “Democratic and Republican Physicians Provide Different Care on Politicized Health Issues.” Proceedings of the National Academy of Sciences of the United States of America 113(42): 11811–11816. Himelein, Kristen. 2015. “The Socio-Economic Impacts of Ebola in Liberia: Results from a High Frequency Cell Phone Survey, Round 5.” Technical report World Bank Group. URL: www.worldbank.org/content/ dam/Worldbank/document/Poverty%20docu ments/Socio-Economic%20Impacts%20of% 20Ebola%20in%20Liberia,%20April%2015 %20(final).pdf Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge, UK: Cambridge University Press. Isbell, Thomas A. 2016. “Data Codebook for a Round 6 Afrobarometer Survey in Liberia.” Technical report Afrobarometer. URL: http:// afrobarometer.org/sites/default/files/data/ round-6/lib_r6_codebook.pdf Iyengar, Shanto, and Lynn Vavreck. 2012. “Online Panels and the Future of Political Communication Research.” In The SAGE Handbook of Political Communication, eds. Holli A. Semetko and Margaret Scammell. Thousand Oaks, CA: SAGE, pp. 225–240. Jackman, Simon, and Bradley Spahn. 2015. “Silenced and Ignored: How the Turn to Voter Registration Lists Excludespeople and Opinions from Political Science and Political Representation.” Working Paper, Stanford University. URL: www.dropbox.com/ s/qvqtz99i4bhdore/silenced.pdf?dl=0. Kalla, Joshua L., and David E. Broockman. 2018. “The Minimal Persuasive Effects of Campaign Contact in General Elections: Evidence from 49 Field Experiments.” American Political Science Review 112(1): 148–166.

78

Joshua L. Kalla, David E. Broockman, and Jasjeet S. Sekhon

Kohut, Andrew, Scott Keeter, Carroll Doherty, Michael Dimock, and Leah Christian. 2012. “Assessing the Representativeness of Public Opinion Surveys.” URL: www.peoplepress.org/files/legacy-pdf/Assessing%20the %20Representativeness%20of%20Public %20Opinion%20Surveys.pdf. Krupnikov, Yanna, and Blake Findley. 2018. “Survey Experiments: Managing the Methodological Costs and Benefits.” In The Oxford Handbook of Polling and Survey Methods, eds. Lonna Rae Atkeson and R. Michael Alvarez. Oxford: Oxford University Press, pp. 483–507. Leeper, Thomas J. 2019. “Where Have the Respondents Gone? Perhaps We Ate Them All.” Public Opinion Quarterly 83(S1): 280–288. McKenzie, David. 2012. “Beyond Baseline and Follow-Up: The Case for More T in Experiments.” Journal of Development Economics 99(2): 210–221. Montgomery, Jacob M., Brendan Nyhan, and Michelle Torres. 2018. “How Conditioning on Posttreatment Variables Can Ruin Your

Experiment and What to Do about It.” American Journal of Political Science 62(3): 760–775. Mutz, Diana C. 2011. Population-Based Survey Experiments. Princeton, NJ: Princeton University Press. Nickerson, David W. 2005. “Scalable protocols Offer Efficient Design for Field Experiments.” Political Analysis 13(3): 233–252. Sävje, Fredrik, Michael Higgins, and Jasjeet S. Sekhon. 2016. “Improving Massive Experiments with Threshold Blocking.” Proceedings of the National Academy of Sciences of the United States of America 113(27): 7369–7376. Sniderman, Paul M., and Douglas B. Grob. 1996. “Innovations in Experimental Design in Attitude Surveys.” Annual Review of Sociology 22: 377–399. Solomon, Richard L. 1949. “An Extension of the Control Group Design.” Psychological bulletin 46(2): 137–150. Zhang, Chelsea, Sean J. Taylor, Curtiss Cobb, and Jasjeet Sekhon. 2019. “Active Matrix Factorization for Surveys.” arXiv preprint arXiv:1902.07634.

CHAPTER 5

How to Tame Lab-in-the-Field Experiments

Catherine Eckel and Natalia Candelo Londono

Abstract In this chapter, we define, categorize, and describe how lab-in-the-field experiments answer questions that concern specific subject populations or particular contexts, concentrating on economics-style experiments in particular. We discuss how to identify questions that require lab-in-the-field methods. We then explain how to develop a lab-in-the-field experiment, highlighting key features of lab and field experiments, and outline the proper usage of pilots in the lab before moving to the field. We next discuss the main dimensions for implementing lab-in-the-field experiments: recruitment, research assistants, literacy of the population, payoffs, community involvement, debriefings, longitudinal surveys, power estimations, etc. We classify studies based on these dimensions. Lastly, we note that the research question should be the most important component driving the choice of subject population.

5.1 The Birth of Lab-in-the-Field and Its definition We begin by defining lab-in-the-field experiments, providing a brief history, and listing the types of lab-in-the-field experiments. We argue that lab-in-the-field experiments lie on a continuum between standard lab experiments and field experiments, and that they typically contain elements of both.

We develop our arguments by focusing on economics-style lab-in-the-field studies hereafter.1 1 While this will become clear later, we note here that economics-style experiments typically adhere to two methodological guidelines: they involve incentivized decisions with monetary payoffs and they do not use deception. In addition, these experiments generally avoid introducing specific, concrete context. Decisions are made by individuals, but also can take place in group or market settings. Lab-in-the-Field

79

80

Catherine Eckel and Natalia Candelo Londono

5.1.1 The Birth of Lab-in-the-Field Experiments Experimental research occupies a continuous spectrum between the lab and the field. At one extreme, “pure” lab experiments are conducted in highly controlled environments, often with a convenience sample of student subjects. Students are not only easily available, but also are high-quality participants, because they reliably understand and comply with experimental instructions. Subjects know that they are in a study and they make decisions in the controlled environment of the lab. At the other extreme are “natural” experiments, where the participants are making decisions as they go about their daily lives and are unaware that they are being observed. In between resides most experimental research, conducted in a wide variety of environments, with different types of subjects. Lab experiments are often seen as more “internally” valid, in that researchers can be relatively certain about the causal relationship between the treatments and the outcomes, while field experiments are seen as more “externally” valid (more generalizable to other settings), given their more natural decision-making environment. The main advantages of the lab are superior control and ready replicability, while field experiments are more vulnerable to control issues,2 and often are much harder or even impossible to replicate.3 Still, whether the results from methods also are used in non-incentivized experiments (e.g., Kim 2019). 2 With lab experiments, the intention to treat is the same as the treatment of the treated: in the lab, we can verify that all subjects are treated. In the field, this is often not the case: there can be a loss of control over who actually receives the treatment. This loss can make it more difficult to pin down a treatment effect. 3 It is easier to replicate lab experiments than field experiments for three main reasons: (1) lab experimenters have greater control over subjects’ experiences through task design (specific incentive and information structures); (2) because of the fixed lab environment, fewer external factors threaten the consistency of the experimental implementation; and (3) field experiments are frequently costly and difficult to implement and therefore are more costly to replicate than lab experiments. See also Roe and Just (2009) for a discussion.

field experiments are more generalizable than lab experiments to other contexts remains an open empirical question (Banerjee and Duflo 2009; Deaton 2010).4 Lab-in-the-field experiments fall between these two extremes, depending on the design. At one extreme, subjects in lab experiments often make decisions in abstract tasks with real payoff consequences that are designed to elicit responses to specific incentive structures and information environments. Formal theoretical models serve as sources of inspiration for many of these abstract tasks. Researchers use these theories not only to determine the set of alternatives the subjects can opt for during the experiment, but also to guide the design of the environment in which subjects will make their decisions. Still, laboratory designs do not exactly mirror the theories they test, so that tests in the lab are always joint tests of the theory along with the researcher’s success in translating the theory into a set of instructions that subjects can consider. Lab researchers typically must relax some theoretical aspects that could overly complicate the task or that are not possible to implement. For example, a competitive market model in economics assumes an infinite number of buyers and sellers, and experimental economists relax this assumption to make a market experiment feasible in the lab (see Smith 1962). Researchers also may manipulate some dimensions as treatments to stress test the predictive power of theories. This includes moving the decision environment from the typically abstract context of the lab (where subjects negotiate over “tokens,” or trade “units,” or contribute to a “fund” that benefits a group) toward a specific context (where subjects negotiate over wages, or trade electricity, or contribute to a playground that benefits the neighborhood). That said, not all interesting aspects of context that could affect the predictions of theories can be 4 Many have argued that field experiments are inherently superior in this respect, but we note that a field experiment is always conducted in a specific context. It is unknown whether some field experimental manipulations produce the same results when applied to other contexts not considered by previous studies.

How to Tame Lab-in-the-Field Experiments

introduced as treatments in a lab experiment. Lab-in-the-field experiments were born as a way to overcome this obstacle. What motivated researchers to set up labs in the field? In the late 1990s, a group of experimental economists and anthropologists met to plan what would become the “small societies” study, using the simple games of experimental economics to assess cultural differences. These could not be assessed in the standard lab experiment because of the acute shortage of cultural variation among students. The study was inspired by Henrich’s dissertation research; he conducted the ultimatum game5 among the Machiguenga tribe in Peru and found, to everyone’s surprise, that economic man was alive and well in the Peruvian jungle (Henrich 2000). The economists trained the anthropologists on procedure and methodology, and the group worked together to adapt the games to the low-literacy individuals living in remote cultures. The group then took these games into the field and conducted them with the populations they studied. The outcome of this effort was, to our knowledge, the first set of lab-in-the-field studies with “culture” as the treatment.6 Notice that culture as a treatment is clearly not randomly assigned: researchers in this study take advantage of the superior control of lab experiments to argue for the causal effects of culture.7 The research focused on ultimatum game play and found that play varied across cultures in ways that reflected two important elements: the degree of cooperation required 5 In this game, two individuals – the proposer and the responder – decide how to divide an endowment of money between themselves. The proposer is the first mover and makes an offer to the responder on how to split the endowment. Then, the responder, who is the second mover, decides whether to accept or reject the offer. The endowment is divided as the proposer suggested if the responder accepts the offer. However, if the responder rejects the offer, neither the proposer nor the responder receives any amount of money from the endowment. 6 The project is summarized in Henrich et al. (2001), with each society described in Henrich et al. (2004). 7 Holland (1986) refers to this as the scientific solution to the problem of causal inference. The argument is that the experiments are sufficiently precise scientific “instruments” that the only remaining source of variation in outcomes is due to culture.

81

for the culture’s main food production (gathering compared to whale-hunting) and the extent to which the culture was exposed to the market (through wage labor or marketing goods). This illustrated the power of experimental games for measuring and understanding cultural variation and inspired many new lab-in-the-field studies using simple games to describe cultural differences, to elicit differences in preferences across cultural groups and to pretest potential policy interventions.8 (Also see Chapter 22 in this volume for an overview on comparative experiments). 5.1.2 What Is a Lab-in-the-Field Experiment? Both lab and lab-in-the-field experiments use abstract tasks with real financial consequences in highly controlled environments. However, lab-in-the-field experiments move beyond the lab in several important ways. We identify four types of lab-in-the-field experiments, designed to address specific types of research questions. Each of these types is discussed in Section 5.2 below, with examples.9 r First, the researcher may wish to test a hypothesis using a specific population or may want to know if a result observed with the standard convenience sample of students generalizes to a broader, or even a representative, population. These experiments might be conducted in a lab and typically use a population other than students, but may even use a specific subsample of students with relevant characteristics. r Second, lab procedures can be exported to the field for use as measures. The most common measures are risk aversion, 8 Many of our favorites are collected in Cárdenas and Carpenter (2008). A few earlier cross-cultural studies were published before the “small societies” study, and these could also be classified as lab-in-the-field (e.g., Kachelmeier and Shehata 1997 compared cooperation in North America and China; Cameron 1999 conducted high-stakes ultimatum games in Indonesia). 9 Viceisza (2016) offers a similar typology. His booklength treatment is a very useful reference for those attempting such lab-in-the-field experiments for the first time.

82

Catherine Eckel and Natalia Candelo Londono

time preference (patience), altruism, cooperation, competition, and in-group discrimination. Incentivized tasks designed for testing theory in lab experiments are adapted for use in the field as outcome or control variables. These may not be experiments in themselves, but be used as part of a larger experimental (or observational) study. r Third, lab-in-the-field experiments are used when, instead of implementing a treatment, the experimenter must recruit participants who have, in a sense, already been treated. Subjects bring their “treatment” with them when they come into the lab (the “small societies” project falls into this category). Researchers resort to this approach when the “treatment” is difficult or impossible to manipulate in the lab. Causal inference is justified by the use of consistent, highly controlled experimental measures, with the only remaining variation coming from culture or experience (see Footnote 7). In some cases, it may be possible to recruit a population of interest that arguably has been treated in a way that sufficiently approximates randomization to treatment. r Fourth, the experimenter may wish to target a particular policy problem by teaching the target population about the games and may use a lab-in-the-field approach to do so. In this case, the game itself may constitute the treatment. Of course, these four types of studies are not mutually exclusive, and some studies may include mixtures of types. It is also important to clarify differences in terminology as compared to previous literature. The widely cited typology introduced by Harrison and List (2004) provided canonical definitions on the spectrum of experimental designs. They define an “artefactual field experiment” as a lab experiment with nonstudent populations, which is close to our definition of labin-the-field experiments. We deviate from this definition in two ways: one related to terminology and the other to inclusion criteria. First, we describe the experimental

lab tasks as real and abstract instead of “artefactual,” which may be misunderstood. Second, following our definition, we classify some lab experiments with students as lab-inthe-field experiments when some attributes of the students/subjects are indispensable to answer the research question – that is, they constitute a specific, targeted population. Our definition also extends to survey experiments, social media experiments, and psychology studies that require distinct samples of students as well as nonstudents. Our definition of lab-in-the-field experiments is closely related to the definition given by Gneezy and Imas (2017), who also use the same term. The main difference is that these authors exclude lab experiments with representative nonstudent samples from the classification of lab-in-the-field experiments. We consider these as part of the family of labin-the-field experiments because they require specific samples, not convenience samples. (See Chapter 9 in this volume for an overview on convenience samples). Charness et al. (2013) use the term “extralaboratory” experiments and define it as we do lab-in-the-field experiments. They also argue that lab-in-the-field experiments are distinct from field experiments and include lab experiments with samples of subjects that are specific to the research question. Their definition is closest to our own.

5.2 Types of Lab-in-the-Field Experiments 5.2.1 Specific Populations An important reason for moving beyond the standard lab experiment is the need to recruit subjects with specific characteristics. It is sometimes argued that students are different from the population and so their results are not generalizable to other populations. A number of experimentalists have extended widely used lab experimental protocols for use with nonstudent populations. One important example of this is the use of representative populations. For example, Bellemare and Kröger (2007) test the external validity of lab experiments by measuring

How to Tame Lab-in-the-Field Experiments

social capital using a trust game10 with students and comparing the lab outcomes with the results of the same game from a representative sample of the Dutch population. In this study, the attribute of representativeness is required to address the question, and in no way would the study have been feasible with a manipulation in the lab. Another approach is to attach an incentivized elicitation or game to a nationally representative survey, as in Fehr et al. (2003), where the researchers embed a trust game into a survey. Fréchette (2016) details some of the advantages (primarily generalizability) and disadvantages (cost being foremost) of representative-sample studies and gives a number of other examples. Other studies use specific populations because the theory in question is thought to be more applicable to the types of environments and decisions they operate in. For example, in auction experiments, several persistent anomalies have been found among students (e.g., the well-known “endowment effect,” whereby subjects value an object more when they own it than when they do not). Researchers have recruited specific populations as diverse as wool traders, sports card traders, oil executives, CEOs, and politicians to explore the generality of the lab findings. For the interested reader, many of these studies are summarized and discussed in Fréchette (2016). Lab-in-the field experiments also serve well to address research questions that require religious participants (see Nielsen 2016 for a comprehensive discussion). For example, Condra et al. (2017) explore the impact of a religious authority versus the impact of scriptures on contributions to public goods in an Islamic religious setting in Afghanistan. For this, they need a sample of subjects for whom religious authority matters. The study elicits voluntary contributions to a public good (a hospital) with several randomly assigned treatments 10 The Online Appendices D and E contain the instructions and forms for this task. The interested reader can find the online appendices using this URL: https://osf.io/6s359/? view_only=64657aacc0f84811971947f2945bebf6.

83

varying the authority (whether a cleric delivering a message is dressed as such or informally) and whether religious verses are read out. The authors find that authority has less impact on contributions to public goods than the religious verses. In other studies, the population is selected because of their particular experience with political institutions. Grossman and Baldassarri (2012) illustrate the value of labin-the-field experiments for understanding the mechanisms underlying the success of sanctioning institutions for fostering cooperation. They combine observational data with lab-in-the-field experiments designed to reflect key aspects of the field setting in Uganda to show that elected officials are more effective at getting their groups to cooperate than appointed leaders. 5.2.2 Measurement When the lab-in-the-field is used for measurement, the designs are guided by the social scientist’s view of individuals as maximizers of utility functions. The measures attempt to capture something like the parameters of those functions.11 A researcher may want to know the risk preferences of villagers facing a choice of agricultural technology, or the time preferences of small business owners, or the altruism or distributional preferences of potential voters, requiring measurement in the field. A particular theory about the relationship between such preferences and the behavior under study (adoption of technology, opening a small business, deciding to vote or run for office) would guide the recruited subject populations and the implementation of the measure. For concreteness, consider an abstract task designed by Eckel and Grossman (2008) to measure risk preferences. In the lab, we 11 The measurement of preferences becomes important when behavior is viewed through a “preferences lens” – that is, behavior is seen as the outcome of preferences interacting with constraints. This view is often adopted by economists and political scientists. Other social/behavioral sciences use other lenses: psychologists might view behavior through a “personality lens” and sociologists through a “social roles” lens. All can be informative.

84

Catherine Eckel and Natalia Candelo Londono

Figure 5.1 Risk preference elicitation.

show subjects a set of lotteries, described as 50/50 gambles, in table format. To adapt the task for the field, we developed a form that makes the gambles very clear and salient, shown in Figure 5.1. We use simple, intuitive instructions and concrete randomization devices such as dice or chips. Subjects in this task must select one circle representing a gamble or investment from Figure 5.1. Each circle is divided in half and contains a low and a high amount of money, each of which has a 50% probability of being won. Each circle is a concrete representation of a lottery, with payoffs indicated by images of currency; the arrow at the top represents a loss of $10. We teach them the task using a script that walks subjects through each possible choice and their payoff consequences. The Online Appendix A contains the instructions for this task. The interested reader can find the online appendices using this URL: https://osf.io/6s359/?view_ only=64657aacc0f84811971947f2945bebf6. The task is structured so as to reveal the decision-maker’s risk aversion. Notice that

the expected payoff increases clockwise starting from the circle with payoffs “$40/$40” to the circle with payoffs “−$10/$130,” but also with the variance (risk). Subjects then face a trade-off between risk and expected payoff. Once a subject reveals a preference for one of the circles, researchers are able to infer an individual’s risk tolerance. Once a subject selects a circle, he or she then draws a card from a bag without looking. The bag contains two chips representing the high or low amount of money. As adapted, the task is simple enough for anyone to understand and takes about five minutes to implement with moderately literate subjects. It has been used in lab-in-the-field experiments with samples whose desired sociodemographic attributes align with the researcher’s study objectives (e.g., Cárdenas and Carpenter 2013; Moya 2018). By using this risk task, researchers classify subjects according to their level of risk tolerance. Risk-averse individuals prefer less exposure to variance (risk), even though this preference implies lower expected values.

How to Tame Lab-in-the-Field Experiments

As a result, these individuals select any of the following circles: $40/$40, $30/$60, $20/$80, or $10/$100. The expected values and variances of the payoffs in these four circles are counterclockwise lower than those of the other two circles ($0/$120 or –$10/$130). For example, a researcher would say that a subject who selects the circle with payoffs “$40/$40” is more risk averse than a person who selects one of the other five circles. A subject who selects the circle with payoffs “$40/$40” prefers the certainty of receiving $40 instead of facing a 50/50 chance of an extra gain/loss. In contrast, riskneutral individuals will select the highest expected value available. Notice that the two circles $0/$120 and –$10/$130 have the same expected value and the highest among all six circles, but different variance. Thus, riskneutral subjects could select any of these two circles. Instead, risk-loving subjects have a preference not only for the highest expected value, but also for the highest variance. As the circle $0/$120 has a lower variance than the circle –$10/$130, a risk-loving subject prefers the circle –$10/$130 instead of the circle $0/$120.12 Many studies collect preference measures as outcome measures or as covariates for use in the data analysis of experimental and survey studies. These are discussed in Gneezy and Imas (2017), who survey the literature in this area. We have included in the online appendices a set of instructions and protocols for measuring preferences in the field. These have all been developed for easy implementation with low-literacy populations for whom participation in research studies is an unusual experience. 5.2.3 Recruiting the Treated Often a research question requires subjects with specific characteristics, but several factors may make it difficult or impossible to randomly allocate or manipulate these 12 If one is willing to assume a particular functional form of a utility function, then parameter ranges for the function are implied by the choices made in this task. The interested reader can find a discussion in Dave et al. (2010).

85

characteristics in the lab. For example, a researcher may wish to understand how exposure to violence impacts certain behaviors, but it would take a very clever design to manipulate violence exposure without crossing ethical lines.13 However, lab-in-the-field experiments could solve this issue by recruiting a population that already has been exposed to violence. In this case, care must be taken to avoid potential selection effects if, for example, experiencing violence is associated with a choice by a subject that could affect exposure. To see how, consider these lab-in-the-field studies. Moya (2018) explores how exposure to violence impacts risk preferences. Individuals exposed to violence suffer economic losses from the destruction of capital and resources, but Moya (2018) explores whether these victims also experience economic losses through psychological damage. Specifically, the main hypothesis is that extreme exposure to violence (i.e., massacres) leads to trauma that leads to psychological disorders (e.g., phobic anxiety), which generates risk aversion and lower levels of investment. This lab-inthe-field study uses the risk task described above and requires a population with three attributes that cannot be reproduced in the lab: (1) persons displaced by violence (2) who experienced moderate to severe levels of violence and (3) were exposed to violence en masse or on separate occasions. 13 It may not be impossible to study this question with a direct manipulation, but it is likely to be risky. For example, a researcher could consider an option that is similar to a peace process: taking an audience that would ordinarily be exposed to extreme violence (i.e., many teenage boys) and encouraging them to experience less violence. However, even here, participants who respond to the encouragement by choosing to experience less violence could be in danger as a result of the study. For example, militants who reject participation in the peace processes sometimes retaliate against previous comrades who decide to join (Acosta 2019). A recent study encouraged individuals in Hong Kong to be more politically active by incentivizing participation in peaceful demonstrations (Bursztyn et al. 2019). Those demonstrations then became considerably less peaceful than when the study began. Despite securing institutional review board approval, the authors are getting quite a bit of pushback due to ethical concerns about the safety of those who responded to the incentives by demonstrating more.

86

Catherine Eckel and Natalia Candelo Londono

Thus, to obtain a sample of 284 victims of violence, Moya engaged in a lengthy recruitment process of internally displaced persons in Colombia, who account for 15% of the country’s population. In this study, preference measurement helps to shed light on the mechanism by which trauma affects subsequent earnings by comparing the preferences of individuals exposed to different levels of violence (i.e., moderate, severe, en masse, separate). Another example is Mironova and Whitt (2014), who investigate how proximity to ethnic groups who were rivals in previous conflicts impacts in-group bias. Their question is how exposure to previous enemies affects pro-social behavior toward members of those groups. The study requires a population in a conflict zone with varying exposure to former rivals. This exposure cannot take place in the lab. As a result, Mironova and Whitt select a relevant field population who have the necessary attributes: Albanians and Serbs in postwar Kosovo. Moreover, the authors guarantee the requisite levels of variation in the proximity of both groups by selecting Kosovo Serbs who live at different distances from Albanians. Kosovo Serbs make decisions in several dictator games that vary the ethnic identity of the recipients (e.g. Albanians and Serbs). The dictator game is a simple abstract decision task that measures altruistic behavior. In this task, a subject receives an endowment and then must decide how much, if any, to give away to another person, the recipient. The more money a subject sends to a recipient out of a given endowment, the more they care about the consumption of the recipient. The difference between the amount sent to an in-group and an out-group member is an incentivized measure of ingroup bias. (The Online Appendices B and C contain the instructions for an experimental design that uses four dictator games and was implemented in a low-income neighborhood in the USA; Candelo et al. 2019.)14 14 In a similar study, Gilligan et al. (2014) explore the impact of wartime violence on social cohesion in Nepal.

As another example, Candelo et al. (2017) study social exclusion in the lab by recruiting subjects with different experiences of social exclusion. Rather than manipulate social exclusion in the lab, which might have a negative impact on subjects, they explore the impact of identity and social exclusion on contributions to public goods games with a lab-in-the-field design. The attributes of interest are social exclusion and identity. Thus, the authors select a population of immigrants in the USA living in low-income neighborhoods who experience different levels and dimensions of social exclusion and US identity (i.e., Hispanics in Texas). The outcome measure that the authors export from lab experiments is a public goods game that measures cooperative behavior in a group. In this task, cooperation with the group is measured by contributions to a public good. The (continuous) treatment is social exclusion, experienced and perceived, as assessed by survey measures. Thus, the subjects are “treated” by their life experiences, and then the extent of the treatment is assessed by the experimenters using experimental measures. The instructions used are in the online appendices of their manuscript. The authors find that higher levels of experienced and perceived social exclusion are associated with lower contributions to public goods. A number of studies have recruited student subjects with specific experiences or characteristics or exposures to a previous “treatment.” For example, Barr and Serra (2010) recruit international students from different societies around the world who have grown up exposed to different levels of corruption to study the persistence of cultural influences on corrupt behavior. Here, the required attribute from the selected populations is different levels of tolerance towards corruption. Relatedly, some studies recruit student subjects in different countries for cross-cultural comparisons. Banuri and Eckel (2015) explore the impact of punishment (i.e., crackdown) on bribery behavior using students in the USA and Pakistan. The authors implement

How to Tame Lab-in-the-Field Experiments

corruption games with two student populations exposed to different levels of bribery and corruption norms. (See Banuri and Eckel 2012 for a review of experiments on corruption.) Another cross-cultural study analyzes the impact of social distance on prosocial behaviors (i.e., altruism, trust, and reciprocity) given different social norms in China, Japan, Korea, and the USA (Buchan et al. 2006). They compare collectivism (i.e., the group needs have priority with respect to individual desires) with individualism (i.e., individual desires are more important than group needs) using samples of student subjects recruited in the four countries. The advantages of these lab-in-the-field experiments with students in different countries are several. First, the total cost of bringing students to the lab is lower than that of bringing other populations to the lab (i.e., students are more accessible to university professors and have lower opportunity costs of participating in lab experiments). Second, students are quite similar in several sociodemographic characteristics, and this facilitates the interpretation of treatment effects. Third, it also illustrates that the demographic characteristics of the sample per se are not critical, but its context is (i.e., cultural background and country of origin vary across student samples). Are these studies experiments? They combine “homegrown” experiences with incentivized elicitations of preferences or behavior in experimental games. The “treatments” are the experiences of the subjects that they bring with them into the lab. Methods developed in lab experiments are then used to elicit the outcome measures: risk preferences (i.e., the Eckel and Grossman risk task described above) and pro-social behavior (i.e., the dictator game). The only way to observe the treatment effects is to go out into the field and obtain subjects who have received the treatment; the typical convenience sample of students cannot be used to answer the research question. Of course, a researcher has the option of making

87

any of the following two strong cases for the experiment to be valid: (1) the control in the experimental games allows for causal inference and (2) the exposure is random and not the result of self-selection (see Chapter 9 in this volume for an overview on convenience samples). 5.2.4 Teaching Beginning with the pathbreaking work of Ostrom (1990), political scientists and economists developed an appreciation for locally evolved solutions to common pool and public goods problems. Some of Ostrom’s students and followers further extended her paradigm by using experiments to teach local populations about ways to manage their resource problems. Notable among these is Juan Camilo Cárdenas, whose lab-in-the-field experiments helped local populations to understand and solve their own social dilemma and common pool resource problems. His approach has been to implement lab-in-the-field experiments for research, but then to use the common experience of the participants in the field to discuss their own resource and commons problems. Bernal et al. (2016) provide an especially interesting example where experimental games, repeated over time, are used not only to test behavior, but also to educate participants about the management of water resources. In this study, 200 individuals who live in an Andean village with water irrigation problems make decisions in the same collective action game twice. After making decisions in the first game, participants attend a workshop in which researchers discuss both the results of the first game and how these relate to their water irrigation problem. A few months later, participants make decisions for a second time in the same collective action game. The authors find that attending the workshop increases individual cooperation in the second game. The games act as a starting point for a discussion about policy that can lead to real change.

88

Catherine Eckel and Natalia Candelo Londono

5.3 How to Get Started with a Lab-in-the-Field Experiment In this section, we discuss how researchers interested in lab-in-the-field experiments should consider combining these with lab and with field experiments.

5.3.1 What Students in the Lab Can Teach Us In this subsection, we argue that traditional lab experiments are indispensable to prepare for implementing lab-in-the-field experiments. The most important reason is that lab-in-the-field researchers usually rely on lab designs, including a limited set of canonical games (e.g., ultimatum, dictator, trust, and public goods games for social preferences and structured decisions to elicit risk and time preferences) that have a long history of usage in the laboratory (Eckel 2014). As examples, the Online Appendices include instructions and forms for dictator, trust, and public goods games (i.e., Appendices B, C, D, and E) and for risk and time elicitations (i.e., Appendices A, F, and G). Previous lab studies have developed and refined procedures for these games, so that the results can be relied upon. Hence, implementing these lab designs in the field is a safe option, as previous studies show their internal validity (i.e., the lab task outcome measures what it is supposed to measure and presents variability), external validity (i.e., the lab task correlates with predicted behaviors in other settings), and stability (i.e., the lab task outcome for an individual, apart from those who have experienced a relevant life change, does not change in a dramatic way at different moments in time). Thus, before going to the field, researchers can modify the lab design to address the specific research question and needs of the target population and run pilots to verify whether the modified design is successful. As the costs are lower in the lab than in the field (including implementation costs as well as subject payments), we suggest running

this type of pilot in the lab. Moreover, it is important to notice that piloting is even more important when a researcher is using an experiment that does not have a canonical history. In contrast, piloting is not necessary if the researcher is using a canonical game without modifications (i.e., the canonical dictator game). For example, Candelo et al. (2018a) evaluate the impact of social distance on giving using a “comparative dictator game,” a modified version of the dictator game in a lab-in-the-field experiment. Previous studies have shown that a feature of the dictator game is that a significant fraction of subjects keep half of the endowment and send the other half to a recipient. There are several potential explanations for a 50/50 split (e.g., a social norm of equal sharing; a cognitively easier decision for subjects), indicating that the dictator game could be capturing something outside its intended purpose of assessing individual altruistic preferences. Then, the dictator game would not be very useful for exploring conditional giving (according to the worthiness of the recipient), as a significant number of outcomes would be a 50/50 split: the norm would bias the results. The modified dictator game addresses this concern: participants make four separate dictator decisions with four different recipients (the instructions and forms for this game are in the Online Appendices B and C). The recipients vary in terms of social distance, from relative to friend to neighbor to anonymous counterpart, and are presented simultaneously to the subject (see Candelo et al. 2019; Eckel et al. 2018 for other applications of the comparative dictator game). The advantage of this design is that subjects can easily compare the provided characteristics of the recipients and act based on these differences. Using pilot data from the lab, the authors show that this modified design alters the typical results: instead of dividing the money evenly, subjects compare the social distance of recipients,15 and there 15 Subjects are asked to bring the names and addresses of a friend and a family member to the session before

How to Tame Lab-in-the-Field Experiments

is a significant reduction in 50/50 splits for the baseline game (a division with an anonymous stranger counterpart). That said, this pilot confirms that the new design works for exploring conditional giving. Candelo et al. (2018a) then implemented the design with a representative sample from 11 low-income Mexican villages in a field setting with high stakes (two days’ wages). Based on prior research, particularly into social networks, the authors generate theoretical predictions for how sharing resources in dictator games will be responsive to recipients who are close and distant social contacts. Their results stress that inter-household and intrahousehold transfers are different mechanisms to attenuate poverty: family contacts share more resources than neighbors and strangers. This main result sheds new light on how policies aimed at reducing poverty should incorporate the vital role social networks play on determining inter-household and intrahousehold transfers. This study illustrates how running pilots in the lab can be used to develop lab-inthe-field designs and protocols. As in the previous example, we suggest that the results of pilots using students should be published along with the ultimate paper. Researchers often pretest several modifications of lab designs before implementing them in the field, but these trials never reach the light of publication. To assist future researchers, we suggest that authors provide more details about their design choices, including explorations with pilots, perhaps in an appendix. Documenting these processes leads to a more rapid progress of science because it shows other researchers which modifications are successful and which ones are not. It also allows researchers to observe how the convenience sample of students reacts in the modified designs, which allows researchers to rule out the possibility knowing the purpose of the study. In terms of social distance, we assume that family members and friends are closer than strangers. Then, subjects compare a family member, a friend, and a stranger. Family members are selected as such by the subjects and could be those who live in the same household or are related but not coresident.

89

that the field-relevant design modification “produced” any observed differences in behavior between students and the target population. Moreover, pilots should be documented to certify and register that the final manuscripts are not the results of fishing expeditions with nonstudent populations. It is important to clarify that this process is different from registering the study, but can be considered a complement to registration, as well as enhancing transparency (see Chapter 18 in this volume for an overview of preregistration). 5.3.2 What the Untamed Field Can Provide In this subsection, we consider how lab-inthe-field experiments can go hand in hand with field experiments in order to explore mechanisms that the latter experiments are not able to disentangle. Subjects in field experiments make decisions that seem more natural to them because the researcher does not have the power to restrict the set of potential outcomes or the context in which decisions are made. This level of naturalism implies a clear trade-off: it is hard for researchers to establish the level of compliance of their subjects to the treatments. Subjects may not even notice the treatments. This means that not all subjects who were supposed to be treated were treated, resulting in an endogeneity of the treatment. As a result, field researchers, who are completely aware of this problem, carefully interpret the treatments in a broad sense to include cases of noncompliance (i.e., this description includes the case when researchers are aiming to estimate the intentto-treat effect).16 Moreover, the level of compliance of the treated subjects sometimes is not clear in field

16 Angrist et al. (1996) note the assumptions under which it is possible to estimate the average treatment effect among compliers (defined as those who would take the treatment if and only if they are assigned to the treatment group). Athey and Imbens (2017) discuss the statistical problems that arise when noncompliers are included in the data analysis as treated or are completely dropped from the analysis (if, indeed, they can be identified).

90

Catherine Eckel and Natalia Candelo Londono

experiments because researchers do not know exactly how subjects perceive the treatment. This is what Viceisza (2016) defined as “the black box” in field experiments. Thus, researchers cautiously interpret the effects of the treatment given the conceivable degrees of noncompliance. For these scenarios, lab-in-the-field experiments can help to “unpack the black box” of field experiments (see Viceisza 2016 for a theoretical model). Researchers use incentivized measurements from lab-in-the-field studies to identify types of individuals for whom there may be heterogeneous treatment effects. For example, Liu (2013) and Ward and Singh (2015) explore how individuals’ risk preferences and loss aversion measured though lab-inthe-field experiments relate to the adoption of new agricultural technologies offered through field experiments in China and India, respectively. Another way to address this problem is by implementing abstract lab designs that could be extrapolated to field designs. Researchers could implement lab designs with a subpopulation of interest from the field study and could test the correspondence of these lab outcomes to the field outcomes (see Coppock and Green 2015 for an overview of the literature on correspondence). As different degrees of noncompliance behavior also generate heterogeneous effects, Athey and Imbens (2017) recommend gathering pretreatment variables related to the heterogeneity to explore variation in local average treatment effects. Lab-inthe-field experiments have a role to play in this pre-stage as well through running lab designs in the field to elicit relevant behaviors for different subpopulations that are participating in field experiments and using these elicitations to help identify key variables related to heterogeneity in treatment effects in the field. As attempts are made to scale up field experiments, these heterogeneous effects could be used to fine tune policy recommendations. Other reason for implementing lab-in-the-field studies together with field experiments is that there is evidence that scaling up field experiments does not always achieve the expected results.

For example, the field experimental results of Garner et al. (1995) promoted laws supporting the arrest of perpetrators of domestic violence after an emergency call by showing that arresting the perpetrator, instead of mediating or separating couples, reduces future domestic abuse by 50%. In contrast, further evidence shows that the implementation of this law has increased the number of domestic homicides (Iyengar 2010). Once this type of law is implemented, more altruistic and risk-averse subjects might not call the police because they do not want to hurt their partners with an arrest or are afraid of doing so (see Paluck and Shafir 2017 for a meticulous discussion). Thus, it is interesting to elicit preferences that might impact the success of a treatment and so might affect success in scaling up the study. For example, scaling up may consist of selecting one among several possible legally sanctioned interventions, which then changes the frame in which the implementation takes place. Subjects develop expectations regarding the impact of the intervention not only on their own outcomes, but also the outcomes of others. Depending on their preferences, this might change the way subjects respond to the treatment. Gerber and Green (2017) consider that field experiments in voter turnout are usually nonpartisan and that the effects of canvassing in partisan environments could be quite different, as shown by non-US studies. Still, there are ethical concerns about running these field experiments because of their potential impact on elections themselves. As a result, researchers collaborate with advocacy groups to conduct field experiments that use partisan messaging (see Chapter 11 in this volume on partnering with groups to run field experiments). In spite of that, lab-in-the-field experiments could examine the behavior of subjects aligned to certain political parties and minimize the impact of canvassing. Finally, the obvious reason to run lab-inthe-field and field experiments together is that both tend to be expensive, and implementing them together is a way to minimize costs and gather more information.

How to Tame Lab-in-the-Field Experiments

5.4 Dimensions of Running Experiments in Field Settings In this section, we describe applications and analyze several dimensions of running lab-inthe-field experiments. We also discuss how certain research questions require different adaptations of a dimension. 5.4.1 Pre-Lab-in-the-Field The first step is to visit the field and develop qualitative research to learn what is untamed in the field before developing the recruitment strategy and implementing the study (Paluck 2010; also see Chapter 20 in this volume on using qualitative data to supplement experiments). In fact, we advise running a pilot test of the design in the field with a small sample of the target population. The ideal is to obtain feedback about the design from these participants through focus groups or individual interviews. The feedback activities should be oriented to evaluate several dimensions of the construal of the design (see Paluck and Shafir 2017 for an overview of the psychology of construal in experiments). First, it is important to test whether participants observe the design as the experimenter intends (see Chapter 12 in this volume for an overview of manipulation checks). Second, different subjects might not have the same mental representation of an experimental task, and the researcher should guarantee that this is not the case. Third, the experimenters should comprehend the underlying drivers of the main behavior they want to analyze, as this helps with conceptualizing the right experimental control and treatments. Fourth, researchers should also understand what are the subjects’ mental representations of the questions in the surveys at this stage. Fifth, a researcher should implement a modified final pilot that includes changes and corrections. This modified pilot should include several manipulation checks that control for these four dimensions. In summary, not only running a pilot with students is a necessary condition to implement a lab-in-the-field

91

experiment, but also running pilots with the selected population.17 In this part of the process, it is also important to confirm that the selected population presents sufficient variation in terms of the attributes or exposure to a previous “treatment” that could not be manipulated in the lab in terms of treatments and may not be present in a typical lab experiment. As we mentioned above, Mironova and Whitt (2014) show that the location of Serbs (i.e., the attribute in their study) presents high variation, which is critical for their lab-in-the-field statistical inference. Moya (2018) engages in the difficult recruitment of internally displaced persons who have experienced different levels of atrocities in the Colombian conflict, guaranteeing high variation in the exposure to violence (the recruitment is described in the Section 5.2.3). Additionally, it is wise to explore ethical issues that could arise with the experimental design or its application and to approach the corresponding ethical authorities in the field. For example, some parts of a study could distress participants who have been exposed to violence. In this scenario, researchers should offer participants several options to opt out from the study, or at least part of it, and check during the pilots whether these preventive measures are sufficient. 5.4.2 Permissions The general rule in this scenario is making sure that everyone who needs to approve the work is on board. As many of the lab-in-thefield studies happen in emerging economies, once the researcher obtains approval from the ethical committee in his or her university, there could be a temptation to skip the local ethical authorities where the study is taking place. The arguments behind this decision could be that there are no ethical committees 17 Notice that we assume that there is no concern about limited subjects and resources. If subjects or resources are limited, we recommend using canonical games without modifications.

92

Catherine Eckel and Natalia Candelo Londono

that can review and approve a project or the process for obtaining permission is too complex. We urge authors to proceed with caution under these circumstances. There are two main reasons for this: first, a study could unintentionally cross an ethical line when the researchers do not have enough knowledge or information about the social norms in a country, and a university ethical board might not be aware of these differences across the world too. Second, the field team could be exposed to dangerous situations by crossing unknown boundaries or violating social norms (see Chapter 7 in this volume for a general discussion of an ethical approach and Desposato 2015 for another general discussion). Third, the field situation may be volatile, and circumstances may change in a way that makes aspects of the experiment more risky to the participants or the researchers (see Footnote 13). One way to address this issue is to contact native scholars from the country where the study is happening and ask them who would be the best local authorities to contact. Another approach is to gather information from scholars who have experience doing fieldwork in the corresponding country. Finally, one should consider not just expected effects, but worst-case scenarios! For example, Callen et al. (2014) estimate the impact of violence on the risk preferences of Afghan subjects. However, a risk elicitation task resembles the act of gambling, which could be perceived as unethical by Muslim subjects, and unwillingness to gamble for religious reasons will distort measured risk preferences. Given this information, the authors decided to run this lab-in-the-field experiment in less conservative areas in Afghanistan. 5.4.3 Recruitment One option is to recruit a random sample of subjects from a given population. In this scenario, researchers randomly assign experimental treatments within the recruited sample. That said, and as we mentioned in

the previous section, it is usual to find that the effects of experimental treatments could differ across covariates (i.e., heterogeneous treatment effects). In order to test for these heterogeneous effects, it may be more appealing to have a balanced sample with respect to the corresponding covariates. That said, if there is an educated guess of heterogeneous effects within a dimension, applying a randomized treatment within a sample might not be the best strategy. In this scenario, it is important to follow a stratified randomized experiment instead of having a randomized treatment in which the covariate could end up unbalanced within the treatment. For example, De Arcangelis et al. (2015) run a lab-in-the-field experiment with Filipino migrants in Italy. The subsamples are unbalanced in this study because more women received the treatment than men, and the authors are not able to estimate with statistical confidence whether the treatment effects are different for males and females. Following Athey and Imbens (2017), a stratified randomized experiment18 would have prevented this issue and would have been a better strategy than a simple randomized sample practice. That said, sometimes researchers implement a stratified randomized sample but during the recruitment process several subjects do not want to participate in the experiments. This case is common when the population does not trust the researchers (e.g., undocumented migrants). There are several solutions in this scenario. One solution is to implement a snowball sample. In this type of sample, subjects refer friends and acquaintances to the study and receive a fee for every subject they bring to the study. The drawback of this strategy is that researchers need several iterations of the snowball to reach a random sample, which costs more time and money, and care must be taken with statistical inference. A second solution is to get involved with the targeted community several months in advance until 18 Most political scientists refer to this as “blocking.”

How to Tame Lab-in-the-Field Experiments

potential subjects trust you. For example, a researcher could connect with community leaders who serve the target population and seek help from them in this process. This solution has some drawbacks that will be discussed below in the subsection of community involvement. It is also pertinent to have representative sociodemographic information about the area where the recruiting is happening. This information allows you to show in the final manuscript how a sample resembles the representative sample of the area. 5.4.4 Safety Lab-in-the-field projects may entail safety issues and ethical concerns. A potential scenario is when the targeted subjects live in areas with high levels of violence and crime. These lab-in-the-field studies put their teams in danger and as such could cross blurry ethical lines. Researchers should not lead recruitment teams to areas where their lives will be in danger. The sampling strategy should consider different random initial points of recruitment such that your team stays away from well-known dangerous streets and the presence of gangs, among other risks. In addition, researchers should develop clever strategies on how to carry money in the field and pay subjects without exposing anyone, neither field team members nor subjects. Bodyguards may be needed. For example, the lab-in-the-field study in Moya (2018) faced several safety difficulties in Colombia because the dynamic of the Colombian conflict intensified during the implementation of the study in 2011. First, the original plan was to recruit victims of violence in three departmental capitals and six municipalities. However, the author was advised by local government officials and the ombudsmen to abandon some of the capitals and municipalities several times. As a result, the author could run experiments in only two capitals and two municipalities. Second, the author had to minimize the time he spent in each place (i.e., one week) as he received “warnings” on a daily basis in each community, and word

93

about him carrying a substantial amount of money spread fast. Third, the author contacted and hired local community leaders as enumerators because the participants were afraid to share their experiences, and it was also important to minimize the exposure of enumerators. Fourth, the author also ran the sessions in local churches, as participants and enumerators considered these places as safe environments. Fifth, local enumerators applied the victimization questionnaire in private to prevent leaking sensitive information about the participants. Sixth, the participants had to remember traumatic events during their participation in the study and the author had to learn techniques to contain emotional crises not only of the participants, but also of the enumerators. Given his and our experience, we think that it is essential to provide psychological support to researchers and enumerators before, during, and after the fieldwork, as the nature of some studies can cause psychological damage to them (e.g., post-traumatic stress disorder). In summary, the main message we want to convey is that the value of a life does not compare to the value of a published manuscript. Be careful. 5.4.5 Research Assistants All research assistants (i.e., enumerators) and field supervisors should be trained not only on the study procedures, but also on ethical concerns. After training, researchers should run several mock elicitations to guarantee that the research assistants and supervisors are following the instructions and the protocol. At the same time, it is important not to reveal the main purpose of the study during training because enumerators can have a conscious or unconscious interest in leading subjects to answer one way or another in the experiment. Moreover, tracking and monitoring enumerators can help ensure that they do not wander from the specified protocol (as they may tend to do!). In addition, it is important to track which enumerators interact with which subjects to estimate experimenter effects.

94

Catherine Eckel and Natalia Candelo Londono

5.4.6 Payoffs In general, researchers control preferences in lab-in-the-field experiments by relying on the induced value theory proposed by Smith (1976). The induced value theory proposes that subjects should be paid guaranteeing three postulates: monotonicity, salience, and dominance. In terms of monotonicity,19 subjects should be paid with a reward medium such that they prefer more of it to less. In many lab-in-the-field experiments, researchers apply this postulate by paying their subjects with local currency unless the purchasing power of it is not stable (i.e., high inflation). Thus, a researcher should investigate the concept of wealth in the field because a “local currency” or a “stable currency” might not be a preferred reward medium. For example, in one of our studies, a researcher assumed that in a high-inflation country (i.e., Venezuela in 2016) it was better to pay subjects with US dollars. However, the researcher finally realized that the subjects in this country find the currency of the neighboring country more useful. There are two reasons behind this strong preference. First, these subjects usually travel to the neighboring country to get food and house supplies. Second, local supermarkets in the other country (i.e., Colombia) do not accept US dollars because counterfeit US dollar bills are in circulation. In terms of salience, the instructions within the experiment should be extremely clear such that subjects understand how their actions in the lab-in-the-field experiments translate into monetary payoffs. In terms of dominance, subjects should be able to receive amounts of money that are substantial and adequate to dominate all other considerations that they might bring to the experiment. Thus, the majority of researchers in lab-in-the-field studies adjust average earnings to one or two days’ wages

19 By monotonicity, we mean that, given a choice between two alternatives where one has a higher monetary reward, the subject will always choose the higher-payoff option. The subject always prefers more to less of the reward medium and does not become satiated.

for unskilled local workers and make private payments (as individuals could care about others’ earnings). In addition, researchers usually adjust the show-up fee to remove any situation of potential losses during the experiments, as subjects could change their behaviors dramatically due to loss aversion (see Kahneman et al. 1991). Some might judge lab-in-the-field studies as being less ethical because their participants receive large payments. This would be true if lab-in-the-field researchers were using the payments coercively to recruit subjects, but researchers are careful not to do that and, indeed, they design the studies so that those who get paid a lot can disguise it if they want to. For example, in one of our studies, we varied the dictator game endowment so that a partner or spouse would not be able to infer what proportion of a payment was shared with them in one of the comparative dictator games (Candelo et al. 2018a). 5.4.7 Literacy of the Population The ideal rule is that researchers should minimize the cognitive load of their subjects during their studies. As a result, language/ instructions/protocols/methods should be easy to comprehend. A researcher needs to make sure that everything in the experiment is understandable and not a psychological burden per se. Academics may forget how smart they are, and indeed, experience with undergraduate student subjects may distort their beliefs about comprehension in the field. In addition, potential comprehension problems should be addressed: this does not merely imply simplified instructions. As we mentioned above, salience is important to control preferences, and researchers should present the information of the experiment in such a way that participants’ understanding reflects exactly what the researcher wants them to understand about the connection between decisions and payoffs. Researchers should debrief subjects during the pilot to determine how the participants’ mental representations of the experiment might deviate from the intention of the experiment. Then, researchers should be able to generate a new

How to Tame Lab-in-the-Field Experiments

conceptualization of the experiment such that their subjects’ mental representations are systematic, identifiable, and correspond to the purpose of the study. A general practice is to mix the experimental design with a survey or systematic debriefing (by a native speaker of the relevant language) to measure miscomprehension, different mental representations of the experiment, and the external validity of the results of the experiment (i.e., questions asking for typical individual behaviors associated with the behavior elicited by the experiment). The researcher may want to include tasks that explicitly test for an understanding of the payment scheme and of dominance. An alternative methodology is use pedagogic techniques to teach the games to the subjects. Experimenters could train subjects in the main concepts of the experiment. Considerable creativity is often involved in communicating the game to the subjects. For example, Paler (2013) evaluates how district windfalls, taxes, and information impact citizens’ political actions in an agricultural low-income district of Indonesia using a lab-in-the-field study. For this type of analysis, subjects should be carefully trained to consider the difference between external government revenue (i.e., windfalls) and internal government revenue (i.e., taxes). In this case, the researchers train subjects in concepts of budgets and windfalls through a board game. Before the experiment starts, subjects allocate their own show-up fees into different household expenses in the board game. After this exercise, the researcher initiates the experiment by introducing the district government budget, windfalls, and taxes in the same board game. 5.4.8 One-on-One Experiments vs. Group Experiments How subjects interact with other subjects while making decisions matters during the implementation and determines the statistical inference stage. Researchers may prefer oneon-one experiments over group experiments for several reasons. First, one-on-one experiments minimize the influence of bystanders

95

and other participants, whose behavior could impact the analyzed individual outcomes. Second, the experimenter can also track more directly the participant’s understanding of the experimental task. Third, during statistical inference, a simple comparison of average treatment effects will suffice (see Athey and Imbens 2017 for more details). Fourth, for experiments that require group interaction, researchers have solved this problem by having prerecords of actions made by previous participants who will be matched with current participants. This methodology is also used to prevent scenarios in which social conflict or social exclusion could arise among participants (see Enos and Gidron 2018 for an example in Israel). Moreover, caution should be taken when researchers implement group sessions with experiments that do not require group interactions. We understand that this option is cheaper than one-on-one sessions, but bystanders could generate several effects. First, subjects tend not to ask questions during group sessions because they fear embarrassment. Then, the risk is that the subjects end up making decisions in an environment that they do not fully comprehend. The use of post-surveys could help to identify this issue. Second, other participants that raise questions/comments could anchor other subjects’ decisions during a session. Hence, researchers must keep clear and extensive records of the events during a group session, as these are vital for understanding the data. One-on-one experiments should not be used when the group experience might be essential for the research question (e.g., deliberation experiments as in Price and Cappella 2007; social networks experiments as in Attanasio et al. 2012). In this scenario, there could be diffusion effects from other participants in the group, but controlling for and tracking these effects is complicated in the statistical analysis at the individual level. As a result, a transparent statistical inference of the impact of a treatment in this scenario is to analyze the data at the group/cluster level instead of at the individual level. Then, a specific treatment should be applied for a

96

Catherine Eckel and Natalia Candelo Londono

whole cluster. Researchers interested in the individual analysis could use linear regression analysis with weights that could be based on the cluster size (see Athey and Imbens 2017). 5.4.9 Community Involvement Researchers interested in the individual outcomes given some type of group interaction could use surveys to collect as much information as possible to control for the diffusion effects that can occur. For example, Attanasio et al. (2012) analyze the impacts of preexisting social networks on risk pooling of individual decisions in 70 Colombian rural communities using a lab-in-the-field experiment. The authors collect information about the social networks in each experimental session and disentangle the potential effects on risk pooling of social interactions depending on the identity of the network members (relatives and friends or acquaintances and strangers).20 Candelo et al. (2018b) follow an alternative approach and direct information along the social networks elicited from participants. The authors explore how information flows through transnational social networks influence individual risky investment decisions using a lab-in-the-field approach. Subjects (Mexican immigrants in the USA) complete a risky investment task and then are given the opportunity to send a message to a member of their social network, at a cost. They give contact information for network members, friends, and family members in Mexico to the researchers. The experimenters are the messengers: they travel to Mexico with the information and deliver it personally to the network members of the Mexican immigrants in Mexico depending on the US subjects’ messages. These individuals in Mexico then also complete the risky investment task and can use any message sent by the US participants in making their own parallel investment decisions. Thus, the researchers travel along the network connections themselves and are careful 20 This study does not violate the “stable unit treatment value assumptions” (SUTVAs). See Chapter 16 in this volume for an overview on the SUTVAs.

to track and control for the diffusion of information such that the “stable unit treatment value assumptions” (SUTVAs) were not violated. In fact, the authors analyze possible information leaks via phone calls or trips to Mexico made by the Mexican immigrants in the USA, but they do not find evidence of spillover effects outside the experimental design. In this study, the authors demonstrate how the information coming from immigrants and social distance can affect the investments and decisions of those who stay in the home country. 5.4.10 Large-Scale Multitask Lab-in-the-Field Studies Researchers could decide to implement other games in the field that test orthogonal questions with respect to the main question. For example, de Oliveira et al. (2011; 2012), Candelo et al. (2017), and Li et al. (2017) are four companion studies that were launched within the same large-scale lab-in-the-field study. Each study differs from its companion studies in several aspects: de Oliveira et al. (2011) is a field experiment identifying donor types in a low-income population; de Oliveira et al. (2012) analyzes the stability of social preferences in a low-income population; Candelo et al. (2017) demonstrates the impact of social exclusion on giving using a Hispanic sample; and Li et al. (2017) explores the impact of an identity prime on donations to charities. All of these studies address distinct public goods issues. Let us address the usual general question of spillover effects in an overall study. Researchers should structure the study to minimize any potential spillover effects across studies. For example, in the large study described above, note that all subjects perform all of the tasks in the same order, and subjects only receive feedback on the one decision selected for payment at the very end of the experiment. That is, they receive no feedback during the experiment itself. Most decisions are individual decisions. Choices that are not individual use the strategy method so that subjects are never responding to a specific action on the part of

How to Tame Lab-in-the-Field Experiments

a counterpart. Therefore, all subjects have the same experience prior to each game, minimizing the potential impact of prior game experience. 5.4.11 Longitudinal Surveys Identifying the stability of experimental measures across time is crucial in this field. The ideal is to find that elicited measures are stable across time; otherwise, researchers should be able to identify events that impact these measures. Then, lab-in-the-field researchers can also augment their studies by pairing their efforts with national longitudinal surveys such that keeping up with a sample for several years is more feasible than on their own. This allows the research to link to a rich archive of data on participants. Not only is following a sample in the long run interesting, but also analyzing a representative sample is statistically convenient. As a side note, caution should be taken to guarantee that a researcher has the capacity and the permission to link experimental data with the survey data not only once, but also in the future. Given our experience, we advise researchers to establish a written legal agreement with the institution in charge of the longitudinal survey before going into the field to avoid disappointment due to personnel changes or changes in the legal environment. This is not the time to rely on trust. For example, Dohmen et al. (2011) are able to link lab-in-the-field experimental data with national survey data in Germany. 5.4.12 Power Estimations Estimating the sample number before going into the field highlights the power of a study. The main idea is to estimate the expected sample size N that can detect a given treatment effect before starting an experiment. First, a researcher should identify in the previous literature a standard deviation of the outcome of interest. Second, a researcher should estimate the necessary sample size N to detect a treatment effect of that standard deviation with a certain significance level and a power estimation

97

(see Athey and Imbens 2017 for such an approach). For example, as mentioned previously, Candelo et al. (2018a) analyze how social distance affects donations using a lab-inthe-field experiment. In order to estimate the power of the study, the authors estimate a standard deviation of donations from a meta-analysis study for dictator games. Then, the authors show that their sample of 1106 subjects detects half of the standard deviation of the meta-analysis study with at least 99% power (many studies aim for 80% power). 5.4.13 Budget Lab-in-the-field experiments clearly need funding, and an estimated budget will be required by the majority of grant-funding institutions (e.g., the National Science Foundation). Thus, once the experimental design is ready, we advise researchers to estimate the total funds that will be used for subject payments, research supplies, travel, and other indirect costs. In terms of subject payments, the researcher must estimate the required number of subjects with a power estimation. In addition, researchers should estimate the average earnings per subject to compute the subject payments. Betweensubject designs tend to be more expensive than within-subject designs as the former provide more independent observations. In terms of research supplies, it is important to estimate the funds that will be used for hiring personnel (e.g., enumerators), printing, materials, and other project expenses. In terms of travel, the researcher must keep in mind lodging, transportation, and meals expenses. In terms of indirect costs, it is always wise to inquire before submitting a proposal how much are the overheads in the institution that will receive the funds. Pay attention to relevant insurance. 5.4.14 One-Context vs. Cultural-Comparison Studies We classify lab-in-the-field experiments into one-context studies and culturalcomparison studies. One-context studies

98

Catherine Eckel and Natalia Candelo Londono

use one population with attributes that the question requires. These attributes present considerable variation within the selected population. Still, some researchers interested in the effect of social/cultural norms might need more than a one-context study, as inducing these norms in the lab is not always easy. In this case, researchers implement culturalcomparison studies using several populations that present considerable variation in terms of the required attributes (see Chapter 22 in this volume for a general discussion of subtypes of cultural-comparison experiments). Thus, we define cross-national survey experiments as a subset of lab-in-the-field studies only if it is essential for these studies to recruit participants in different nations. Our closing message is that many research questions require features in the population that are impossible to simulate in the lab, and as a result, it is indispensable to implement lab-in-the-field studies.

5.5 Final Discussion and Potential Warnings Our chapter complements the previous literature on the classification of laboratory experiments with nonstudent populations (Gneezy et al. 2017; Harrison et al. 2004) by specifying a general rule that allows researchers to distinguish and classify certain group of studies as lab-in-the-field experiments. We strongly suggest that researchers avoid subjectivity in the sense of running experiments with nonstudent populations because the population is considered to be relevant per se. An important message we want to convey is that a population that is outside the lab is not inherently more interesting nor more relevant than one that is in the lab. Furthermore, it may be the experience rather than the identity of the population that makes it appropriate for addressing a particular research question. We assert that the question leads the researcher to implement this type of research because the attributes of the population or the context are critical for addressing the problem.

Our chapter also highlights the importance of pairing lab-in-the-field experiments with lab and field experiments. Lab experiments refine the lab-in-the-field designs during the pilot stages. Field experiments are not able to separate the effects of certain individual preferences that companion lab-in-thefield interventions could help to uncover. Our chapter also discusses extensively the main dimensions of implementing labin-the-field studies that researchers should pay attention to. In general, we discuss possible data analysis problems and how to anticipate and avoid them. We make several recommendations and discuss good practices for each dimension. We also stress the disadvantages of some practices, and we believe that future studies should try to innovate and overcome these obstacles. Our general recommendation is to push toward not only debriefing policymakers and executive powers, but also giving feedback to the community. There is some tendency of social scientists to disengage from the community once their final manuscripts are published. That said, we do not think that all information is useful to share with a community, but a concise message that the community can use to improve socioeconomic outcomes can be very important (see Cárdenas and Ostrom 2004). The last potential warning we want to discuss is that final manuscripts always give the impression of being the result solely of the work of the main coauthors. Many labin-the-field studies are possible given the joint effort of a team of individuals (i.e., coauthors, experimental leaders, research assistants, enumerators, and participants) who end up only being summarized in the acknowledgments. In fact, many lessons from this chapter are the result of shared experiences of a substantial number of individuals who have worked with us in the field for the last 16 years. Thus, we believe that the presence of a main coauthor/deputy in the field is essential to understanding your population and how it interacts with your research team. Moreover, we suggest that this main coauthor/deputy could evaluate the input of each worker in the field such

How to Tame Lab-in-the-Field Experiments

that final published manuscripts could begin to include more information about the contributions of the research team members. This can be very encouraging for research assistants and might help them down the road. People say that entertainment sometimes does a better job of capturing attention than science, and maybe that reputation is well deserved. However, all movies give credits at the end. Shouldn’t we start doing the same?

References Acosta, Luis. 2019. “Murder of Hundreds of Colombian Activists Casts Shadow Over Peace Process.” Reuters, August 25. Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91(434): 444–455. Athey, Susan, and Guido W. Imbens. 2017. “The Econometrics of Randomized Experiments.” In Handbook of Economic Field Experiments, eds. A. Banerjee and E. Duflo. Amsterdam: Elsevier, pp. 73–140. Attanasio, Orazio, Abigail Barr, Juan Camilo Cárdenas, Garance Genicot, and Costas Meghir. 2012. “Risk Pooling, Risk Preferences, and Social Networks.” American Economic Journal: Applied Economics 4(2): 134–167. Banerjee, Abhijit V., and Esther Duflo. 2009. “The Experimental Approach to Development Economics.” Annual Review of Economics 1(1): 151–178. Banuri, Sheheryar, and Catherine Eckel. 2012. “Experiments in Culture and Corruption: A Review.” In New Advances in Experimental Research on Corruption, eds. A. Barr and D. Serra. Bingley: Emerald Group Publishing Limited, pp. 51–76. Banuri, Sheheryar, and Catherine Eckel. 2015. “Cracking Down on Bribery.” Social Choice and Welfare 45(3): 579–600. Barr, Abigail, and Danila Serra. 2010. “Corruption and Culture: An Experimental Analysis.” Journal of Public Economics 94(11–12): 862–869. Bellemare, Charles, and Sabine Kröger. 2007. “On Representative Social Capital.” European Economic Review 51(1): 183–202.

99

Bernal, Adriana, Juan-Camilo Cárdenas, Laia Domenech, Ruth Meinzen-Dick, and J. Sarmiento Paula. 2016. “Social Learning through Economic Games in the Field.” Mimeo. Universidad de Los Andes. Buchan, Nancy R., Eric J. Johnson, and Rachel T. A. Croson. 2006. “Let’s Get Personal: An International Examination of The Influence of Communication, Culture and Social Distance on Other Regarding Preferences.” Journal of Economic Behavior & Organization 60(3): 373–398. Bursztyn, Leonardo, Davide Cantoni, David Y. Yang, Noam Yuchtman, and Y. Jane Zhang. 2019. “Persistent Political Engagement: Social Interactions and the Dynamics of Protest Movements.” Working Paper. University of Chicago. Callen, Michael, Mohammad Isaqzadeh, James D. Long, and Charles Sprenger. 2014. “Violence and Risk Preference: Experimental Evidence from Afghanistan.” American Economic Review 104(1): 123–148. Cameron, Lisa A. 1999. “Raising the Stakes in the Ultimatum Game: Experimental Evidence from Indonesia.” Economic Inquiry 37(1): 47–59. Candelo, Natalia, Rachel T. A. Croson, and Sherry Xin Li. 2017. “Identity and Social Exclusion: An Experiment with Hispanic Immigrants in the US.” Experimental Economics 20(2): 460–480. Candelo, Natalia, Catherine Eckel, and Cathleen Johnson. 2018a. “Social Distance Matters in Dictator Games: Evidence from 11 Mexican Villages.” Games 9(4): 77. Candelo, Natalia, Rachel T. A. Croson, and Catherine Eckel. 2018b. “Transmission of Information within Transnational Social Networks: A Field Experiment.” Experimental Economics 21(4): 905–923. Candelo, Natalia, Angela C. M. de Oliveira, and Catherine Eckel. 2019. “Worthiness Versus Self-Interest in Charitable Giving: Evidence from a Low-Income, Minority Neighborhood.” Southern Economic Journal 85(4): 1196–1216. Cárdenas, Juan Camilo. 2018. “(Real) Behavior Meets (Real) Institutions: Towards a Research Agenda on the Study of the Commons.” In A Research Agenda for New Institutional Economics, eds. Claude Ménard and Mary M. Shirley. Cheltenham: Edward Elgar Publishing, pp. 119–126.

100

Catherine Eckel and Natalia Candelo Londono

Cárdenas, Juan-Camilo, and Elinor Ostrom. 2004. “What Do People Bring into the Game? Experiments in the Field about Cooperation in the Commons.” Agricultural Systems 82(3): 307–326. Cárdenas, Juan Camilo, and Jeffrey Carpenter. 2008. “Behavioural Development Economics: Lessons from Field Labs in the Developing World.” Journal of Development Studies 44(3): 311–338. Cárdenas, Juan Camilo, and Jeffrey Carpenter. 2013. “Risk Attitudes and Economic WellBeing in Latin America.” Journal of Development Economics 103: 52–61. Charness, Gary, Uri Gneezy, and Michael A. Kuhn. 2013. “Experimental Methods: Extra-Laboratory Experiments-Extending the Reach of Experimental Economics.” Journal of Economic Behavior & Organization 91: 93–100. Condra, Luke N., Mohammad Isaqzadeh, and Sera Linardi. 2017. “Clerics and Scriptures: Experimentally Disentangling the Influence of Religious Authority in Afghanistan.” British Journal of Political Science 49(2): 401–419. Coppock, Alexander, and Donald P. Green. 2015. “Assessing the Correspondence between Experimental Results Obtained in the Lab and Field: A Review of Recent Social Science Research.” Political Science Research and Methods 3(1): 113–131. Dave, Chetan, Catherine C. Eckel, Cathleen A. Johnson, and Christian Rojas. 2010. “Eliciting Risk Preferences: When Is Simple Better?” Journal of Risk and Uncertainty 41(3): 219–243. De Arcangelis, Giuseppe, Majlinda Joxhe, David McKenzie, Erwin Tiongson, and Dean Yang. 2015. “Directing Remittances to Education with Soft and Hard Commitments: Evidence from a Lab-In-The-Field Experiment and New Product Take-Up among Filipino Migrants in Rome.” Journal of Economic Behavior & Organization 111: 197–208. De Oliveira, Angela C. M., Rachel T. A. Croson, and Catherine Eckel. 2011. “The Giving Type: Identifying Donors.” Journal of Public Economics 95(5–6): 428–435. De Oliveira, Angela C. M., Catherine Eckel, and Rachel T. A. Croson. 2012. “The Stability of Social Preferences in a Low-Income Neighborhood.” Southern Economic Journal 79(1): 15–45. Deaton, Angus. 2010. “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature 48(2): 424–455.

Dohmen, Thomas, Armin Falk, David Huffman, and Uwe Sunde. 2011. “The Intergenerational Transmission of Risk and Trust Attitudes.” Review of Economic Studies 79(2): 645–677. Desposato, Scott, ed. 2015. Ethics and Experiments: Problems and Solutions for Social Scientists and Policy Professionals. New York: Routledge. Eckel, Catherine C., and Philip J. Grossman. 2008. “Forecasting Risk Attitudes: An Experimental Study Using Actual and Forecast Gamble Choices.” Journal of Economic Behavior & Organization 68(1): 1–17. Eckel, Catherine. 2014. “Economic Games for Social Scientists.” In Laboratory Experiments in the Social Sciences, eds. M. Webster and J. Sell. Oxford: Elsevier, pp. 335–355. Eckel, Catherine, Benjamin Priday, and Rick Wilson. 2018. “Charity Begins at Home: A Lab-in-the-Field Experiment on Charitable Giving.” Games 9(4): 95. Enos, Ryan D., and Noam Gidron. 2018. “Exclusion and Cooperation in Diverse Societies: Experimental Evidence from Israel.” American Political Science Review 112(4): 742–757. Fehr, Ernst, Urs Fischbacher, Bernhard Von Rosenbladt, Jürgen Schupp, and Gert G. Wagner. 2003. “A Nation-Wide Laboratory: Examining Trust and Trustworthiness by Integrating Behavioral Experiments into Representative Survey.” Working Paper. Social Science Research Network. Fréchette, Guillaume R. 2016. “Experimental Economics across Subject Populations.” The Handbook of Experimental Economics, eds. J. Kagel and A. Roth. Princeton, NJ: Princeton University Press, pp. 435–480. Garner, Joel, Jeffrey Fagan, and Christopher Maxwell. 1995. “Published Findings from the Spouse Assault Replication Program: A Critical Review.” Journal of Quantitative Criminology 11(1): 3–28. Gerber, Alan S., and Donald P. Green. 2017. “Field Experiments on Voter Mobilization: An Overview of a Burgeoning Literature.” In Handbook of Economic Field Experiments, eds. A. Banerjee and E. Duflo. Amsterdam: Elsevier, pp. 395–438. Gilligan, Michael J., Benjamin J. Pasquale, and Cyrus Samii. 2014. “Civil War and Social Cohesion: Lab-in-the-Field Evidence from Nepal.” American Journal of Political Science 58(3): 604–619.

How to Tame Lab-in-the-Field Experiments Gneezy, Uri, and Alex Imas. 2017. “Lab in the Field: Measuring Preferences in the Wild.” In Handbook of Economic Field Experiments, eds. A. Banerjee and E. Duflo. Amsterdam: Elsevier, pp. 439–464. Grossman, Guy, and Delia Baldassarri. 2012. “The Impact of Elections on Cooperation: Evidence from a Lab-In-The-Field Experiment in Uganda.” American Journal of Political Science 56(4): 964–985. Harrison, Glenn W., and John A. List. 2004. “Field Experiments.” Journal of Economic Literature 42(4): 1009–1055. Henrich, Joseph. 2000. “Does Culture Matter in Economic Behavior? Ultimatum Game Bargaining Among the Machiguenga of the Peruvian Amazon.” American Economic Review 90(4): 973–979. Henrich, Joseph, Robert Boyd, Samuel Bowles, Colin Camerer, Ernst Fehr, Herbert Gintis, and Richard McElreath. 2001. “Cooperation, Reciprocity and Punishment in Fifteen Small-Scale Societies.” American Economic Review 91(2): 73–78. Henrich, Joseph Patrick, Robert Boyd, Samuel Bowles, Ernst Fehr, Colin Camerer, and Herbert Gintis, eds. 2004. Foundations of Human Sociality: Economic Experiments and Ethnographic Evidence from Fifteen Small-Scale Societies. Oxford: Oxford University Press on Demand. Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81(396): 945–960. Iyengar, Radha. 2010. “Does Arrest Deter Violence? Comparing Experimental and Nonexperimental Evidence on Mandatory Arrest Laws.” In The Economics of Crime: Lessons For and From Latin America, eds. R. Di Tella, S. Edwards, and E. Schargrodsky. Chicago, IL: University of Chicago Press, pp. 421–452. Kachelmeier, Steven J., and Mohamed Shehata. 1997. “Internal Auditing and Voluntary Cooperation in Firms: A Cross-Cultural Experiment.” Accounting Review 72(3): 407–431. Kahneman, Daniel, Jack L. Knetsch, and Richard H. Thaler. 1991. “Anomalies: The Endowment Effect, Loss Aversion, and Status Quo Bias.” Journal of Economic Perspectives 5(1): 193–206. Kim, Eunji. 2019. “Entertaining Beliefs in Economic Mobility”. PhD dissertation University of Pennsylvania.

101

Li, Sherry Xin, Angela C. M. de Oliveira, and Catherine Eckel. 2017. “Common Identity and The Voluntary Provision of Public Goods: An Experimental Investigation.” Journal of Economic Behavior & Organization 142: 32–46. Liu, Elaine M. 2013. “Time to Change What to Sow: Risk Preferences and Technology Adoption Decisions of Cotton Farmers in China.” Review of Economics and Statistics 95(4): 1386–1403. Moya, Andrés. 2018. “Violence, Psychological Trauma, and Risk Attitudes: Evidence from Victims of Violence in Colombia.” Journal of Development Economics 131: 15–27. Mironova, Vera, and Sam Whitt. 2014. “Ethnicity and Altruism after Violence: The Contact Hypothesis in Kosovo.” Journal of Experimental Political Science 1(2): 170–180. Nielsen, Richard A. 2016. “Ethics for Experimental Manipulation of Religion.” In Ethics and Experiments, ed. Scott Desposato. Abingdon: Routledge, pp. 56–79. Ostrom, Elinor. 1990. Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge, UK: Cambridge University Press. Paler, Laura. 2013. “Keeping the Public Purse: An Experiment in Windfalls, Taxes, and the Incentives to Restrain Government.” American Political Science Review 107(4): 706–725. Paluck, Elizabeth. 2010. “The Promising Integration of Qualitative Methods and Field Experiments.” Annals of the American Academy of Political and Social Science 628(1): 59–71. Paluck, Elizabeth Levy, and Eldar Shafir. 2017. “The Psychology of Construal in the Design of Field Experiments.” In Handbook of Economic Field Experiments, eds. A. Banerjee and E. Duflo. Amsterdam: Elsevier, pp. 245–268. Price, Vincent, and Joseph N. Cappella. 2007. “Healthcare Dialogue: Project Highlights.” In Proceedings of the 8th Annual International Conference on Digital Government Research: Bridging Disciplines & Domains. Los Angeles, CA: Digital Government Society of North America, pp. 278–279. Roe, Brian E., and David R. Just. 2009. “Internal and External Validity in Economics Research: Tradeoffs between Experiments, Field Experiments, Natural Experiments, and Field Data.” American Journal of Agricultural Economics 91(5): 1266–1271.

102

Catherine Eckel and Natalia Candelo Londono

Smith, Vernon L. 1962. “An Experimental Study of Competitive Market Behavior.” Journal of Political Economy 70(2): 111–137. Smith, Vernon L. 1976. “Experimental Economics: Induced Value Theory.” American Economic Review 66(2): 274–279. Viceisza, Angelino. 2016.“Creating a Lab in the Field: Economics Experiments for

Policymaking.” Journal of Economic Surveys 30(5): 835–854. Ward, Patrick S., and Vartika Singh. 2015. “Using Field Experiments to Elicit Risk and Ambiguity Preferences: Behavioural Factors and the Adoption of New Agricultural Technologies in Rural India.” Journal of Development Studies 51(6): 707–724.

CHAPTER 6

Natural Experiments∗

Rocío Titiunik

Abstract The term “natural experiment” is used inconsistently. In one interpretation, it refers to an experiment where a treatment is randomly assigned by someone other than the researcher. In another interpretation, it refers to a study in which there is no controlled random assignment, but treatment is assigned by some external factor in a way that loosely resembles a randomized experiment – often described as an “as-if random” assignment. In yet another interpretation, it refers to any nonrandomized study that compares a treatment to a control group, without any specific requirements on how the treatment is assigned. I introduce an alternative definition that seeks to clarify the integral features of natural experiments and at the same time to distinguish them from randomized controlled experiments. I define a natural experiment as a research study where the treatment assignment mechanism (1) is neither designed nor implemented by the researcher, (2) is unknown to the researcher, and (3) is probabilistic by virtue of depending on an external factor. The main message of this definition is that the difference between a randomized controlled experiment and a natural experiment is not a matter of degree, but of essence, and thus conceptualizing a natural experiment as a research design akin to a randomized experiment is neither rigorous nor a useful guide to empirical analysis. Using my alternative definition, I discuss how a natural experiment differs from a traditional observational study and offer practical recommendations for researchers who wish to use natural experiments to study causal effects. * I am grateful to Donald P. Green and James N. Druckman for their helpful feedback on multiple versions of this chapter, and to Marc Ratkovic and participants at the Advances in Experimental Political Science Conference held at Northwestern University in May 2019 for valuable comments and suggestions. I am also indebted to Alberto Abadie, Joshua Angrist, Matias Cattaneo, Angus Deaton, Guido Imbens, and Luke Keele for their insightful comments and criticisms, which not only improved the manuscript, but also gave me much to think about for the future. This chapter has also benefited from my collaborations on natural experiments and observational studies over the years with Jasjeet Sekhon.

103

104

Rocío Titiunik

The framework for the analysis and interpretation of randomized experiments is routinely employed to study interventions that are not experimentally assigned but nonetheless share some of the characteristics of randomized controlled trials. Research designs that study nonexperimental interventions invoking tools and concepts from the analysis of randomized experiments are sometimes referred to as natural experiments. However, the use of the term has been inconsistent both within and across disciplines. My first goal is to introduce a definition of natural experiment that identifies its integral features and distinguishes it clearly from a randomized experiment where treatments are assigned according to a known randomization procedure that results in full knowledge of the probability of occurrence of each possible treatment allocation. I call such an experiment a randomized controlled experiment to emphasize that the way in which the randomness is introduced is controlled by the researcher and thus results in a known probability distribution. One of the main messages of the new definition is that the difference between a randomized controlled experiment and a natural experiment is not a matter of degree, but of essence, and therefore conceptualizing a natural experiment as a research design that approximates or is akin to a randomized experiment is neither rigorous nor a useful guide to empirical analysis. I then consider the ways in which a natural experiment in the sense of the new definition differs from other kinds of nonexperimental or observational studies. The central conclusions of this discussion are that, compared to traditional observational studies where there is no external source of treatment assignment, natural experiments (1) have the advantage of more clearly separating pre- from posttreatment periods and thus allow for a more rigorous falsification of its assumptions and (2) can offer an objective (though not directly testable) justification for an unconfoundedness assumption. My discussion is inspired and influenced by the conceptual distinctions introduced by Deaton (2010) in his critique of experimental and quasi-experimental methods in

development economics (see also Deaton 2020) and is based on the potential outcomes framework developed by Neyman (1923[1990]) and Rubin (1974) – see also Holland (1986) for an influential review and Imbens and Rubin (2015) for a comprehensive textbook. The use of natural experiments in the social sciences was pioneered by labor economists around the early 1990s (e.g., Angrist 1990; Angrist and Krueger 1991; Card and Krueger 1994) and has been subsequently used by hundreds of scholars in multiple disciplines, including political science. My goal is not to give a comprehensive review of prior work based on natural experiments nor a historical overview of the use of natural experiments in the social sciences. For this, I refer the reader to Angrist and Krueger (2001), Craig et al. (2017), Dunning (2008, 2012), Meyer (1995), Petticrew (2005), Rosenzweig and Wolpin (2000), and the references therein. See also Abadie and Cattaneo (2018) for a recent review of program evaluation and causal inference methods.

6.1 Two Examples I start by considering two empirical examples, both of which are described by their authors as natural experiments at least once in their manuscripts. The first example is the study by Lassen (2005), who examines the decentralization of city government in Copenhagen, Denmark. In 1996, the city was divided into 15 districts, and four of those districts were selected to introduce a local administration system for four years. The four treated districts were selected from among the 15 districts to be representative of the overall city in terms of various demographic and social characteristics. In 2000, a referendum was held in the entire city, giving voters the option to extend decentralization to all districts or eliminate it altogether. Lassen compares the referendum results of treated versus control districts to estimate the effect of information on voter turnout. The assumption is that voters in the treated districts are better informed

Natural Experiments

about decentralization than control voters, and the hypothesis tested is that uninformed voters are more likely to abstain from voting, which at the aggregate level should lead to an increase in voter turnout in the treated districts. Lassen (2005) considers the assignment of districts to the decentralization/no decentralization conditions as “exogenously determined variation” (p. 104) in whether city districts have firsthand experience with decentralization. Lassen (2005) then uses decentralization as an instrument for information, but I focus exclusively on the “intention-to-treat” effect of decentralization on turnout. The second example is Titiunik (2016), where I studied the effect of term length on legislative behavior in the state senates of Texas, Arkansas, and Illinois. In these states, state senators serve for four years and are staggered, with half of the seats up for election every two years. Senate districts are redrawn immediately after each decennial census to comply with the constitutionally mandated requirement that all districts have equal populations. But state constitutions also mandate that all state senate seats must be up for election immediately after reapportionment. In order to comply with this requirement and keep seats staggered, in the first election after reapportionment all seats are up for election, but the seats are randomly assigned to two term-length conditions: either serve two years immediately after the election (and then two consecutive four-year terms) or serve four years immediately after the election (and then one four-year term and another two-year term). Titiunik (2016) used the random assignment to two-year and four-year terms that occurred after the 2002 election under newly redrawn districts to study the effect of shorter terms on abstention rates and bill introduction during the 2002–2003 legislative session.

6.2 Two Common Definitions of a Natural Experiment The two examples presented above share a standard program evaluation setup (e.g.,

105

Abadie and Cattaneo 2018; Imbens and Wooldridge 2009), where the researcher is interested in studying the effect of a binary treatment or intervention (decentralization, short term length) on an outcome of interest (voter turnout, abstention rates). They also have in common that neither study was designed by the researcher: the rules that determined which city districts had temporary decentralized governments or which senators served two-year terms were decided, respectively, by the city government of Copenhagen and the state governments of Arkansas, Illinois, and Texas – not by the researchers who published the studies. In both cases, the researcher saw an opportunity in the allocations of these interventions to answer a question of long-standing scientific and policy interest. Despite their similarities, the examples have one crucial difference. In the decentralization study by Lassen (2005), the assignment of districts to the decentralized/notdecentralized conditions was not determined by a physical randomization device, but rather by officials seeking to select treated districts that were representative of the city as a whole. In contrast, the assignment of senate seats to two-year or four-year terms was based on a fixed-margins randomization device that gave each senator the same probability of serving two- or four-year terms.1 Common definitions of the term natural experiment include the researcher’s lack of control over the treatment assignment as an integral feature. At the same time, researchers who invoke natural experiments assume that, despite the lack of control over the assignment of the treatment, some external forces of nature imbue the design with some superior credibility for causal inference relative to other observational studies where such external factors are absent. But current definitions of natural experiments do not explicitly describe the 1 See Titiunik (2016) for details on the assignments in each state. In Texas, for example, the 35 senate seats were allocated by creating 17 pieces of paper marked with “2” and 18 marked with “4,” mixing all pieces inside a bowl, and having each of the 35 elected senators draw one piece of paper without looking.

106

Rocío Titiunik

source of such superior credibility, other than invoking an analogy between the “natural experimental” treatment assignment and the kind of treatment assignment that governs randomized controlled trials. There are two ways in which this analogy is made: one is literal, and the other is figurative, leading to two common different definitions of a natural experiment. In the literal interpretation of Gerber and Green (2012), p. 15, a natural experiment is a situation in which there is random assignment of a treatment via a randomization device, but this assignment is not under the control of the researcher. According to this definition, the term length study in Titiunik (2016) is a natural experiment, but the decentralization study in Lassen (2005) is not. Other examples that conform to this definition of natural experiment include Erikson and Stoker (2011), who use the Vietnam draft lottery to study the effect of the military draft on political attitudes, and Bhavnani (2009), who uses a rule that randomly reserves one third of seats to women candidates in India’s local elections to study the impact of reservations on women’s future electoral success.2 This definition has the advantage of being precise. Understood as a randomized experiment controlled by an external party, a natural experiment can be analyzed by directly applying the standard tools from the analysis of randomized experiments. To be sure, the interpretation of the estimated parameter can still pose serious challenges when the groups that the randomization deems comparable are not directly informative about the parameter of scientific interest (Sekhon and Titiunik 2012).3 But interpretation issues aside, the assumptions and methods for estimation and inference under controlled randomization are well established. Because 2 Gerber and Green (2012), p. 16, following Cook and Campbell (1979) and Cook et al. (2002), use the term quasi-experiment to refer to studies such as Lassen (2005) where no actual randomization device is employed. 3 Sekhon and Titiunik (2012) consider various examples of natural experiments where this phenomenon occurs, including the study by Bhavnani (2009) cited above.

this definition of a natural experiment is conceptually clear and its implementation relatively uncontroversial, it is not the focus of my discussion. Instead, my interest lies in another widely used definition that interprets a natural experiment as some sort of imperfect approximation to a randomized controlled experiment. According to this figurative definition, a natural experiment is a situation in which an external event introduces variation in the allocation of the treatment, and the researcher uses the external event as the basis to claim that the treatment is “as good as random” or “as-if random,” but no physical randomization device is explicitly employed by any human being. Scholars who employ this notion of natural experiments do not typically offer a formal definition of “as-if randomness,” but rather refer heuristically to an analogy or comparison with randomized experiments. Different versions of this analogy have been offered in political science, economics, public health, and other sciences. In political science, Dunning (2008) defines a natural experiment as a study in which the data come “from naturally occurring phenomena” (p. 282), where the treatment is not assigned randomly, but the researcher makes “a credible claim that the assignment of the nonexperimental subjects to treatment and control conditions is ‘as if’ random” (p. 283). In economics, Meyer (1995) defines a natural experiment as a study that investigates “outcome measures for observations in treatment groups and comparison groups that are not randomly assigned” (p. 151), and Angrist and Krueger (2001) define a natural experiment as a situation “where the forces of nature or government policy have conspired to produce an environment somewhat akin to a randomized experiment” (p. 73). In public health, Petticrew (2005) define natural experiments in contrast to randomized experiments, as designs in which “the researcher cannot control or withhold the allocation of an intervention to particular areas or communities, but where natural or predetermined variation in allocation occurs” (p. 752).

Natural Experiments

This definition of a natural experiment, which I shall name the “as-if random” definition, seems to be more common among empirical researchers than the definition of Gerber and Green (2012). Most empirical researchers who invoke natural experiments refer to cases where a treatment is allocated by forces outside their control and not based on a randomization device.

6.3 Conceptual Distinctions Given the widespread use of the as-if random interpretation of a natural experiment, my focus in this chapter is on research designs of this type. That is, I focus on research designs where there is no physical randomization device intentionally controlled by a human being with the purpose of allocating the treatment, but rather the treatment assignment is determined by an external factor. However, I depart from the as-if random definition of natural experiments and instead present a definition in which natural experiments are defined in contrast to randomized experiments as opposed to akin to them. My definition encompasses the spirit of the as-if random understanding of a natural experiment, but introduces a more rigorous understanding of the role of experimental manipulation and random assignment, introducing conceptual distinctions that have so far remained fused. My discussion builds most directly on prior arguments by Deaton (2010) and on several definitions discussed by Imbens and Rubin (2015). The case of experimentation without randomization is beyond the scope of my discussion, but is worth considering at least briefly. Loosely, an experiment is a study in which the researcher executes a controlled intervention over some process in order to test a hypothesis and/or explore potential mechanisms. An experimental intervention need not be randomly assigned, and indeed nonrandom experiments are ubiquitous in the natural sciences, where there is sufficient prior knowledge (such as established laws of physics) to plausibly create a controlled environment.

107

Given the meaning of the term experiment, the term natural experiment seems to be an oxymoron, since the adjective natural often refers to the researcher’s lack of control over the treatment assignment. A randomized controlled experiment is thus a special case of an experiment, and the opposite of an experiment is an observational study (where the researcher is unable to intervene in or control the conditions). In my proposed usage, discussed at length below, a natural experiment is (oxymoronically) a special case of an observational study, not a special case of an experiment. Rather than changing established usage of these terms, in the following pages I seek to clarify the concepts that these terms refer to. 6.3.1 Randomized Experiments The first step to arriving at a precise and encompassing definition of a natural experiment requires that we define the term randomized experiment with some precision. For this, I follow the Neyman–Rubin potential outcomes framework (Neyman 1923[1990]; Holland 1986; Rubin 1974) and introduce standard notation. I assume that the researcher studies a population of n units, indexed by i = 1, 2, . . . , n, and her goal is to analyze the effect of a binary intervention or treatment Z, with Zi = 1 if i is assigned to the treatment condition and Zi = 0 if i is assigned to the control. Each unit i has two potential outcomes corresponding to each one of the treatment conditions, with Yi (1) the outcome that i would attain under treatment and Yi (0) the outcome that i would attain under control. The observed outcome is Yi = Zi Yi (1) + (1 − Zi )Yi (0), and Xi is a vector of k covariates determined before the treatment is assigned (hereafter called predetermined covariates). The individuallevel variables are collected in the n×1 vectors (or n × k matrix), Y(1), Y(0), X, and Z. This notation can be used to describe the two examples above. In the Lassen (2005) study, the units are city districts, Zi = 1 if a district’s government is decentralized and Zi = 0 otherwise, and Yi is district-level voter

108

Rocío Titiunik

turnout. In the Titiunik (2016) study, the units are state senators, Zi = 1 if senator i serves a two-year term after redistricting and Zi = 0 if he or she serves a four-year term instead, and Yi is the abstention rate or number of bills introduced during the post-redistricting legislative session. The vector Z = z gives the particular arrangement of treated and control units that occurred. For example, if n = 5 and z = [1, 0, 0, 1, 1], units 1, 3, and 5 were assigned to the treatment group and units 2 and 3 were assigned to the control. I define the assignment mechanism Pr(Z|X, Y(1), Y(0)) as in Imbens and Rubin (2015). This function gives the probability of occurrence of each possible value of the treatment vector Z. It thereforetakes values in the interval [0, 1] and satisfies z∈{0,1}n Pr(z|X, Y(0), Y(1)) = 1 for all X, Y(0), Y(1). From this, we can define the unit-level assignment probability pi as the sum of the probabilities associated with all of the assignments that result in unit i receiving the treatment, pi (X, Y(0), Y(1)) =  z:zi =1 Pr(z|X, Y(0), Y(1)). Imbens and Rubin (2015) define randomized experiments in terms of restrictions placed on the assignment mechanism. I restate some of these restrictions and their definition of randomized experiment, which I then use as the basis of my discussion. The first restriction I consider requires that every unit be assigned to treatment with probability strictly between zero and one. Formally, Pr(Z|X, Y(0), Y(1)) is a probabilistic assignment (Imbens and Rubin 2015, p. 38) if 0 < pi (X, Y(0), Y(1)) < 1 for every i, for each X, Y(0), Y(1). (6.1) An assignment is probabilistic when every unit has both a positive probability of being assigned to the treatment condition and a positive probability of being assigned to the control condition – in other words, when all units are “at risk” of being assigned to both conditions before the treatment is in fact assigned. Importantly for our purposes, a probabilistic assignment rules out deterministic situations where, conditional on X, Y(0),

and Y(1), units are assigned to one of the treatment conditions with certainty. Given these possible restrictions on the assignment mechanism, Imbens and Rubin (2015) offer a definition of a randomized experiment. Definition RE (Randomized Experiment, Imbens and Rubin 2015, p. 40). A randomized experiment is a study in which the assignment mechanism satisfies the following properties: (C) Pr(Z|X, Y(0), Y(1)) is controlled by the researcher and has a known functional form. (P) Pr(Z|X, Y(0), Y(1)) is probabilistic. Several aspects of this definition are relevant for our purposes. First, the word “randomized” in the definition stems from condition (P) (probabilistic assignment), while the word “experiment” stems from condition (C) (researcher’s knowledge and control). The researcher designs and controls the assignment of the treatment, thus creating an experiment or controlled manipulation, and this assignment is not deterministic, in the sense that no unit can rule out ex ante the possibility of being assigned to either one of the conditions. None of the empirical examples introduced above satisfies this definition of a randomized experiment, but for somewhat different reasons. In Titiunik (2016), the assignment mechanism is both probabilistic and known, but it is not under the researcher’s control and thus violates the control part of condition (C). In Lassen (2005), both parts of condition (C) are violated, as the researcher has neither control over the assignment mechanism nor knowledge of its exact functional form. Second, this definition clearly separates the notion of randomization from the notion of “valid” comparison groups or lack of confounders, a distinction that is essential for characterizing natural experiments. Definition RE explicitly allows for the potential outcomes to affect the assignment mechanism, making clear that a probabilistic assignment does not guarantee that treated and control

Natural Experiments

groups will be comparable, in the sense that it does not guarantee that the treatment is (conditionally) independent of the potential outcomes. Such an unconfoundedness condition must be added as a separate requirement. Formally, the assignment mechanism Pr(Z|X, Y(0), Y(1)) is unconfounded (Imbens and Rubin, 2015, p. 38) if it satisfies Pr(Z|X, Y(0), Y(1)) = Pr(Z|X, Y(0) , Y(1) ) for all Z, X, Y(0), Y(1), Y(0) , Y(1) .

(6.2)

An unconfounded assignment is one in which the probability of each possible treatment allocation vector is not a function of the potential outcomes. This property is violated when, for example, units who have higher potential outcomes are more likely to be assigned to the treatment condition than to the control even after conditioning on the available observable characteristics. In general, any study where units self-select into the treatment based on characteristics unobservable to the researcher that correlate with their potential outcomes constitutes an assignment mechanism that is not unconfounded. When a randomized experiment also satisfies unconfoundedness, Imbens and Rubin (2015) call it an unconfounded randomized experiment. Building on the above definitions, I now state a definition of a randomized controlled experiment. (As I discuss below, this definition is different from Imbens and Rubin’s definition of an unconfounded randomized experiment.) Definition RCE (Randomized Controlled Experiment). A randomized controlled experiment (RCE) is a study in which the assignment mechanism satisfies the following properties: (D) Pr(Z|X, Y(0), Y(1)) is designed and implemented by the researcher. (K) Pr(Z|X, Y(0), Y(1)) is known to the researcher. (P) Pr(Z|X, Y(0), Y(1)) is probabilistic by means of a randomization device whose physical features ensure that Pr(Z|X, Y(0), Y(1)) is unconfounded.

109

In a RCE as I have defined it, the assignment mechanism is probabilistic, designed and implemented by the researcher, known to the researcher, and not a function of the potential outcomes (possibly after conditioning on observable characteristics). The latter condition means that the probability that the treatment assignment vector Z is equal to a given z is entirely unrelated to the unit’s potential outcomes, possibly after we have conditioned on X. My definition of a RCE is similar to Imbens and Rubin’s definition of an unconfounded randomized experiment, with one key difference. In the RCE definition, condition (P) explicitly requires that unconfoundedness be a direct consequence of the type of physical randomization device used to allocate the treatment probabilistically. This explicitly links unconfoundedness to the randomization device. The joint requirements of full control of the design and implementation (D) and knowledge (K) of the assignment mechanism imply that in a RCE the treatment assignment mechanism is fully reproducible. In most cases, full knowledge and reproducibility of the assignment mechanism will be direct consequences of the researcher’s being in control of the treatment assignment, and thus condition (K) will be implied by condition (D). However, sometimes knowledge of the mechanism occurs despite the researcher not being in control of the experiment, which is why I separate conditions (D) and (K) in the definition. This occurs in experiments where, just as in a RCE, the treatment assignment is probabilistic and unconfounded by virtue of the use of a physical randomization device, but where the design and implementation of the assignment mechanism are not under the control of the researcher. I shall call this type of experiment a randomized third-party experiment (RTPE).4 I define it below for completeness. Definition RTPE (Randomized ThirdParty Experiment). A randomized thirdparty experiment (RTPE) is a study in which 4 Others have called this a randomized policy experiment. See, for example, Clayton (2015).

110

Rocío Titiunik

the assignment mechanism satisfies the following properties: (D ) Pr(Z|X, Y(0), Y(1)) is designed and controlled by a third party. (K) Pr(Z|X, Y(0), Y(1)) is known or knowable to the researcher. (P) Pr(Z|X, Y(0), Y(1)) is probabilistic by means of a randomization device whose physical features ensure that Pr(Z|X, Y(0), Y(1)) is unconfounded. The inclusion of the word “knowable” in the RTPE definition is meant to encompass studies where the probabilities of treatment assignment are not necessarily known explicitly, but can be recovered based on other features of the assignment. An insightful ˘ et al. example is given by Abdulkadiroglu (2017), who study a centralized assignment mechanism, known as deferred acceptance (DA), that matches students to schools of their preference. The authors show that any DA mechanism that satisfies the equal treatment of equals condition – students who have the same preferences and priorities about all schools are assigned to each school with the same probability – results in a mapping from preferences, priorities, and school characteristics into a conditional probability of random assignment that can be recovered via simulation (and also analytically under some additional assumptions). Although this example is not a “conventional” randomized experiment where treatment assignment probabilities are known ex ante, the DA algorithm gives the experimenter enough knowledge about the assignment mechanism so that the probabilities of treatment assignment can be deduced ex post and then conditioned on to obtain the conditional independence between treatment and potential outcomes that would have held if the school district had explicitly used those probabilities to assign students via a lottery. Given the definition above, this study is a RTPE despite the lack of explicit probabilities. 6.3.2 Three Senses of the Word “Random” Some of the ambiguity regarding natural experiments has stemmed from the failure

to properly distinguish randomness from unconfoundedness and mistakenly assuming that randomness in the assignment mechanism automatically guarantees treatment and control groups that are comparable in all relevant respects. At least part of the ambiguity seems to stem from the different senses of the word “random” that are used sometimes interchangeably to describe both natural experiments and RCEs. I now discuss different meanings of “random” and their relationship to unconfoundedness, relying on a related distinction between externality and exogeneity introduced by Deaton (2010) in his critique of the use of natural experiments as sources of instrumental variables (IVs). Following a terminology first adopted by Heckman, Deaton distinguishes between an instrument being “external” to refer to variables that are determined outside the system, and “exogenous” to refer to the orthogonality condition that is needed for consistent estimation of the parameter of interest in an IV context. My focus in this chapter is on studies where interest lies directly on the effect of Z on Y and not on its effect via another variable, so I ignore concerns about the exclusion restriction. However, I will show that even in this simpler case the distinction between the externality of Z and the type of “randomization” that such externality creates is essential to understanding the ways in which natural experiments differ from RCEs. It is well known that RCEs can violate the IV exclusion restriction (Angrist et al. 1996). Thus, in IV settings, natural experiments and RCEs are on a more equal footing, in the sense that neither can guarantee the identifiability of the treatment effect of interest. When it comes to the “reduced form” effect of Z on Y , however, natural experiments face unique challenges that are absent in RCEs. My interest in this section is to discuss these particular challenges, and for this reason I focus on the effect of Z on Y . However, my discussion also applies to IV settings, because the challenges faced by natural experiments in identifying the reduced form effect remain when natural experiments

Natural Experiments

are used as a “source of instruments” (Angrist and Krueger 2001, p. 73). I consider three different uses of the term “random,” all of which have been used to characterize natural experiments – though I do not mean to imply that these are the only three ways in which the term “random” has been used in the history of science. The first is what I call the colloquial definition of random; this is the first sense listed by the MerriamWebster dictionary, which defines the adjective random as “lacking a definite plan, purpose, or pattern,” and further clarifies that this use “stresses lack of definite aim, fixed goal, or regular procedure.” Used in this sense, a random treatment assignment refers to an assignment mechanism that follows an arbitrary, inscrutable plan that has no clear pattern. The notion of inscrutability is similar to the concept of Knightian uncertainty in economics. In his seminal study, Knight (1921) used the term risk to refer to the kind of uncertainty that is measurable and quantifiable with objective probabilities, and reserved the term uncertainty to refer to situations where the randomness cannot be objectively quantified and thus cannot be insured in the market. A similar distinction was advanced by Keynes (1921); see the discussion in LeRoy and Singell Jr. (1987). The second meaning of the word “random” is most likely found in statistics textbooks. This sense of random, which I call the statistical definition of random, refers to situations in which we have uncertainty about what event will occur, but we can precisely characterize all possible events that may occur and exactly quantify the probability with which each event will occur (analogous to Knightian risk). In this sense, a random treatment assignment is an assignment of units to treatment and control conditions in which the uncertainty can be completely and exactly quantified via the function Pr(Z|X, Y(0), Y(1)), which specifies the probability of the occurrence of each possible treatment allocation. Used in the statistical sense, randomization thus refers to “the selection of an element a, from a set A,

111

according to some probability distribution P on A” (Berger 1990). In his treatise on experimental design, Fisher (1935) explicitly rejects the colloquial sense of random in his definition of a randomized experiment. While discussing an agricultural experiment that assigns land plots to various crops to test the relative yield of each crop variety, Fisher is explicit in ruling out haphazardness or arbitrariness: In each block, the five plots are assigned one to each of the five varieties under the test, and this assignment is made at random. This does not mean that the experimenter writes down the names of the varieties, or letters standing for them, in any order that may occur to him, but that he carries out a physical experimental process of randomisation, using means which shall ensure that each variety has an equal chance of being tested on any particular plot of ground. (Fisher, 1935, p. 51)

The above passage suggests yet a third sense of random, which is in fact a particular case of the statistical definition. This third definition equates randomness with a situation in which all possible outcomes are equally likely. This is the sense used by Fisher in the passage above, and even more explicitly described in Fisher (1956) when he discusses random throws of a die: … we may think of a particular throw, or of a succession of throws, as a random sample from the aggregate, which is in this sense subjectively homogeneous and without recognizable stratification. (Fisher, 1956, p. 35)

When used in this third sense, a random assignment mechanism refers to a mechanism that gives every single possible arrangement of treated and control units the same probability of occurrence. For example, if an assignment mechanism allocates exactly nt units to treatment and n − nt units to control, it is random in this  sense if Pr(Z = z|X, Y(0), Y(1)) = 1/ nnt for all z. I call this the equiprobable sense of random. In sum, the three senses of random refer to three different kinds of uncertainty. The colloquial sense means uncertainty that is

112

Rocío Titiunik

arbitrary and inscrutable, not amenable to characterization by a clear pattern. The statistical sense of random refers to uncertainty that can be precisely characterized by a known probability distribution. And the equiprobable sense of random is a particular case of the statistical sense and refers to uncertainty that is characterized by a known probability distribution that assigns equal probability to each possible outcome. The ambiguous and overlapping usage of the term “random” is why defining a natural experiment as having an “as-if” random treatment assignment lacks statistical rigor. If random is used in the colloquial sense, then the “as-if” qualifier is not needed and distorts meaning, as random in the colloquial sense already refers to an arbitrary/inscrutable assignment. If random is used in the statistical sense, the “as-if” qualifier is simply incorrect. The assignment vector Z is a random variable, and as such it has some distribution over the sample space of assignments. Used in the statistical sense, a natural experiment has a real random assignment, not an “as-if” random assignment. Finally, used in the equiprobable sense, a natural experiment is typically not random at all: earthquakes are more likely to destroy huts than concrete buildings, rain on election day is more likely in Seattle than in Arizona, and abortion restriction laws are more likely to be passed in socially conservative than in socially liberal constituencies. 6.3.3 Random Assignment Does Not Imply Probabilistic Assignment An assignment mechanism that is random in either the statistical or the equiprobable sense need not be probabilistic in the sense of Eq. (6.1). For example, neither the statistical nor the equiprobable definition of random rules out a treatment assignment mechanism in which all units are assigned to treatment with probability one. This point is trivially true – a constant is a special case of a random variable in which all the probability mass is accumulated at a single value – but it matters for our purposes. Of course, since random assignment is usually discussed in the

context of evaluating the effects of receiving a treatment relative to not receiving it, the existence of a comparison group in this context is presupposed. This is why Fisher does not explicitly include “probabilistic” in his definition of random, but it is clear that he does so implicitly. Informally, the requirement that the assignment be probabilistic is essential if our purpose is to obtain comparable treated and control units; otherwise, the treatment assignment may be perfectly correlated with confounders.5 The colloquial sense of random does rule out the particular deterministic assignment that assigns every unit to treatment (or to control), since in this case a very clear pattern of assignment would be discernible. However, other forms of nonprobabilistic assignments are still compatible with the colloquial notion of randomness. For example, Fisher’s farmer could decide that plots on the edge of the property line will always be assigned to the same crop. This decision would be entirely arbitrary, thus satisfying the colloquial definition of random. Moreover, to the external observer, this nonprobabilistic assignment would be hard to catch, unless he or she happens to measure the proportion of boundary plots in treatment versus control groups. This point turns out to be important: in natural experiments, since the assignment mechanism is unknown to the researcher, he or she will not be able to distinguish probabilistic from deterministic assignments, because the assignment could be deterministic conditional on a characteristic that is unobserved to the researcher – which would misleadingly give the appearance of a probabilistic assignment. A probabilistic assignment is therefore an assignment that is random in the statistical sense, with the added restriction that the probability distribution that characterizes 5 See Heckman et al. (1998) for a formal characterization of the bias introduced by violations of a probabilistic assignment in the context of selection on observables, which also applies immediately to stratified randomized experiments. If the assignment is deterministic for units with certain characteristics X = x, this introduces a lack of common support that impedes obtaining valid causal effects even if the assignment is unconfounded.

Natural Experiments

the randomness not assign extreme (i.e., zero or one) individual probabilities. 6.3.4 Random Assignment Does Not Imply Unconfoundedness In a randomized experiment as stated in Definition RE, no unit has perfect control over which treatment it receives, in the sense that all units have ex ante probability of being assigned to both the treated and control conditions. The assignment is therefore random in the statistical sense, governed by Pr(Z|X, Y(0), Y(1)). However, a probabilistic assignment does not imply an unconfounded assignment. This point is easy to see in terms of our decentralization example. Imagine that in the Lassen (2005) study some city districts have high crime and reducing crime is the top priority of government administrators. Imagine also that decentralization gives districts more precise tools to combat and reduce crime. To say that the assignment is probabilistic or “randomized” is to say, for example, that districts lack the ability to perfectly and precisely self-select into the decentralization treatment that they believe will result in the most effective crime reduction. But this does not mean that highcrime districts have the same probability of being decentralized as low-crime districts. Perhaps officials from high-crime areas forcefully express their strong preference for decentralization to city administrators, and this results in their having a larger probability of receiving the treatment than low-crime areas. A probabilistic assignment only means that this probability is not one (nor zero); it does not mean that different types of units have the same probability of receiving treatment. If assignments with decentralized high-crime areas are more likely than assignments with decentralized low-crime areas, a naive, unadjusted comparison of treated versus control outcomes will not yield a consistent estimate of the average effect of decentralization. A valid comparison requires that we reweight or stratify the observations based on the different probabilities of receiving treatment, something that is easy

113

to do if we know the exact functional form of Pr(Z|X, Y(0), Y(1)), but entirely unfeasible if this assignment mechanism is unknown and unknowable. One way to think of a confounded assignment is as a blocked randomized experiment in which different “types” of individuals defined by potential outcomes are assigned to treatment with different probabilities. For example, imagine that all units have the same potential outcome under control, Yi (0) = y0 for all i. Defining high types as units with Yi (1) − y0 > 0 and low types as units with Yi (1) − y0 ≤ 0, we can conceive of a randomized experiment that violates unconfoundeness as a blocked randomized experiment where high types are assigned to treatment with higher probability than low types and types are unobservable to the researcher. It is well known that the proper analysis of a stratified randomized experiment with treatment assignment probabilities that vary by strata or blocks requires accounting for the different strata, which in turn requires knowing the strata to which every unit belongs (see, e.g., Athey and Imbens 2017; Gerber and Green 2012; Imbens and Rubin 2015). In this example, failing to account for the different strata would overestimate the true average treatment effect. In general, obtaining valid conclusions from an unconfounded blockrandomized experiment is not feasible when the strata remain hidden from the researcher. In other words, chance does not imply comparability. Finally, note that randomness in the equiprobable sense does imply unconfoundedness. An assignment mechanism that is equiprobable is also unconfounded: any assignment that gives each vector z the same probability of being chosen is by construction attaching a constant probability to each z, which as a consequence cannot be a function of the potential outcomes. But the converse is not true: an unconfounded assignment mechanism does not imply that each possible treatment assignment vector z must be equally likely. For example, a mechanism that uses a random device to allocate two-thirds of women and one-third of men to treatment is unconfounded, but

114

Rocío Titiunik

it is not random in the equiprobable sense when all units are considered as a whole (though it is equiprobable within gender blocks). An equiprobable random assignment mechanism is perhaps the simplest way to ensure an unconfounded assignment, which may be why the term “random assignment” is often used as a synonym for unconfoundedness.

6.3.5 Physical Devices or Procedures That Ensure Unconfoundedness Because a probabilistic assignment mechanism does not imply that the mechanism is unconfounded, it follows that the superior credibility of RCEs does not stem exclusively from random chance. Although chance or uncertainty are needed to ensure condition (P) in the RCE definition, chance alone is not enough to bestow an experiment with the ability to identify causal effects. Somewhat counterintuitively, part of the power of randomized experimentation lies not in the creation of uncertainty, but rather in the use of physical randomization devices or procedures that are capable of assigning treatments without being influenced by the units’ potential outcomes. By a physical randomization device or procedure I mean a set of rules that allows the researcher to assign the treatment according to a known or knowable probability distribution function.6 These procedures are more than the means by which the end of probabilistic assignment is achieved; they ensure that chance is introduced in a way that ensures identification of causal effects and the quantification of uncertainty. Without a physical randomization device that ensures knowledge of the probability distribution of the assignment mechanism,

6 These rules could rely on assignment probabilities implicitly rather than explicitly, as in Abdulka˘ et al. (2017). The key requirement is that diroglu the researcher be able to fully recover and reproduce the induced probabilities of treatment assignment, even if those probabilities were not explicitly used to assign units.

chance is not necessarily helpful. To see this, consider the following strategies to introduce a probabilistic treatment assignment. We could stack paper applications on a desk and blow a fan at them and then assign to treatment the applications that fall to the floor. Or we could have an octopus select applications,7 or let Fisher’s farmer choose the applications “in any order that may occur to him.” All of these strategies would be random in the colloquial sense. It might also be plausible to assume that all of these strategies would lead to a probabilistic assignment in the sense that, a priori, all applications would have a nonzero chance of being selected for treatment and for control – though this may be difficult to verify. However, it would be premature to claim that the assignment is unconfounded. For example, if the original pile of applications on the table were sorted alphabetically with Z at the bottom and A at the top and the wind was more likely to blow away top applications, we would have more names in the A–L part of the alphabet assigned to treatment than to control. Since, for example, ethnicity often correlates strongly with last name, our treatment and control groups would very likely differ on ethnicity, and as a result on any other observable and unobservable characteristics that may correlate with it, such as immigration status, political orientation, neighborhood of residence, etc. Examples of physical randomization procedures are varied. Fisher (1935) describes a device based on a deck of cards for an agricultural experiment in which five plots of land are to be assigned randomly to five fertilizer varieties. Cards are numbered from 1 to 100 and repeatedly shuffled so that they are arranged in random order; the five treatments are numbered 1–5; and the experimenter draws one card for every plot. The fertilizer assigned to the plot is the remainder obtained when the number on the drawn card is divided by 5 if the number is not a multiple of 5; if it is, the plot is assigned to fertilizer 5. This procedure guarantees that 7 See the case of Paul the psychic of Oberhausen (e.g., www.bbc.com/news/10420131).

Natural Experiments

the each fertilizer variety corresponds to 20 cards; since there are 100 cards, the probability that each plot is assigned to each of the fertilizer varieties is 1/5. Another randomization device is a rotating lottery drum where the researcher deposits balls or tickets containing numbers representing each of the experimental units. The balls or tickets are drawn after rotating the drum, ensuring that at any point each of the remaining balls has the same probability of being selected. This procedure was used, for example, to assign each one of the integers between 1 and 366 to each one of the possible birth dates in a year (including February 29) to select who would be drafted to the Vietnam War, the numbers 1–366 indicating the order in which men would be drafted. (The Vietnam lottery, however, seems to have failed to produce equally likely outcomes; see discussion below.) In scientific studies conducted today, the most common mechanism to allocate treatments randomly is based on computergenerated pseudorandom numbers. The principles underlying the generation of pseudorandom numbers offer important lessons for our discussion. Pseudorandom numbers can be generated in multiple ways, but all of them share the characteristic of being entirely predictable, directly ruling out the colloquial definition of random. For example, the Lehmer linear congruential algorithm (Lehmer 1951; Park and Miller 1988) requires the choice of a prime modulus m, an integer a ∈ 2, 3, . . . , m − 1, and an initial value x0 . The value x1 is generated as x1 = ak where k ≡ x0 mod m is the remainder when x0 is divided by m, and all subsequent values are generated as xi+1 = axi mod m. Given the initial value x0 , the entire sequence is entirely determined, which illustrates the fundamental distinction between the colloquial and the statistical definitions of random, lucidly summarized by Park and Miller: Over the years many programmers have unwittingly demonstrated that it is all too easy to “hack” a procedure that will produce a strange looking, apparently unpredictable

115

sequence of numbers. It is fundamentally more difficult, however, to write quality software which produces what is really desired – a virtually infinite sequence of statistically independent random numbers, uniformly distributed between 0 and 1. This is a key point: strange and unpredictable is not necessarily random. (Park and Miller 1988, p. 1193)

Formally demonstrating that randomization devices do in fact produce an equidistributed sequence of numbers is difficult, both because the physical properties of certain randomization devices can be complex (e.g. Aldous and Diaconis 1986) and because demonstrating (and even defining) the randomness of a sequence is a hard mathematical problem (see, e.g., Downey and Hirschfeldt 2010; Pincus and Kalman 1997; Pincus and Singer 1996). Nonetheless, with our current knowledge of mathematics and algorithmic randomness, several randomization devices such as pseudorandom number generators or sufficiently shuffled cards are in fact able to produce independent, uniformly distributed numbers. I refer to such devices as proper randomization devices to distinguish them from randomization devices that appear but ultimately fail to produce equidistributed sequences. The feature that proper randomization devices have in common is that: (1) the allocation of units to the treatment/control conditions that they produce is entirely determined by their physical and statistical properties, which are by construction unrelated to the units’ potential outcomes and thus result in an unconfounded assignment mechanism; and (2) these properties are known and well understood, which in turn implies that the assignment mechanism is entirely known or knowable and thus reproducible. Thus, proper physical randomization devices not only ensure that there is an element of chance regarding which units receive treatment, but also, by their very properties, they simultaneously guarantee that the assignment mechanism is unconfounded. The use of a proper physical randomization device is as fundamental in its role to ensure unconfoundedeness as in its role to ensure random chance. We could

116

Rocío Titiunik

introduce chance in treatment assignment using fans, octopus, or earthquakes. But only a fully known and reproducible physical randomization procedure guarantees the type of randomness that can be used as the basis for inference and identification. This guarantee, however, is not bulletproof. There are numerous and notable examples where the physical properties of randomization devices failed to produce unconfounded assignments because they were mistakenly believed to be proper devices. For example, the implementation of the 1970 Vietnam lottery is believed to have been defective (the capsules not sufficiently mixed), assigning systematically lower numbers to birth dates in later months, contrary to the uniform distribution that the lottery drum was supposed to produce (see, e.g., Fienberg 1971). This is a case where the physical properties of the device were mistakenly believed to produce an equiprobable assignment. For another example, see the Lanarkshire milk experiment (Student 1931). Such “failures of randomization” can invalidate a RCE or RTPE, unless the true probabilities induced by the defective randomization device can be learned or discovered. However, note that in the case of the Vietnam lottery, researchers were able to detect the departure from an equiprobable assignment precisely because they believed that the physical randomization device guaranteed such an assignment and because an equiprobable assignment has objective empirical implications (similar number of observations per birth month, etc.). This ability to detect departures from a known randomization distribution is only possible when such a distribution can be specified ex ante. It is precisely because we believe that the Vietnam lottery drums should have produced a uniform assignment that we discover that something must have been wrong with the device (or with our beliefs about the device). In contrast, in natural experiments, because we fundamentally ignore the distribution of the external assignment mechanism, we have no way of using the observed assignment to validate our beliefs

about the physical randomization device used by nature, at least not in the absence of additional assumptions.

6.4 An Alternative Definition of Natural Experiment The key feature of the as-if random interpretation of a natural experiment is the existence of an external factor or phenomenon that governs the allocation of treatment among units. This external phenomenon is most commonly ruled by the laws of nature (earthquakes, hurricanes, etc.) or the laws of government (minimum age restrictions, voting rules, etc.) and results in a treatment assignment that has been variously described as haphazard (Rosenbaum 2002), as-if random (Dunning 2008), naturally occurring (Rutter 2007), not according to any particular order (Gould et al. 2004), serendipitous (DiNardo 2016; Rosenzweig and Wolpin 2000), unanticipated (Carbone et al. 2006), unpredictable (Dunning 2012), unplanned (Lalive and Zweimüller 2009), quasi-random (Fuchs-Schündeln and Hassan 2016), or a shock (Miguel et al. 2004). As discussed above, an arbitrary and unpredictable treatment assignment implies neither unconfoundedness nor knowledge (and thus reproducibility) of the assignment mechanism – two distinctive features of RCEs and RTPEs. For this reason, I introduce a definition of a natural experiment that preserves the externality of the treatment assignment mechanism but, in contrast to prior interpretations, emphasizes its nonexperimental qualities rather than its “as-if randomness.” In my definition, the external phenomenon that governs treatment assignment ensures (in successful cases) that the assignment mechanism is probabilistic, but not that it is unconfounded. I first distinguish RCEs from observational studies and then define a natural experiment as a particular case of an observational study. For this, I consider two dimensions. The first is whether the researcher is in control of the design and implementation of the experiment. The second dimension

Natural Experiments

117

Table 6.1 Typology of randomized experiments and observational studies. Probabilities known or knowable to researcher Yes No Designed and implemented Yes Randomized controlled experiment (RCE) Observational study by the researcher No Randomized third-party experiment (RTPE)

is whether the probabilities associated with each possible treatment allocation are known (or knowable). The four possible combinations of these two criteria are illustrated in Table 6.1. Given a probabilistic treatment assignment, the difference between a randomized experiment and a nonexperimental design depends crucially on both knowledge and control of the assignment mechanism. When a researcher controls the design and implementation of a probabilistic treatment assignment, he or she has full knowledge of all of the probabilities associated with each possible treatment allocation. As a consequence, the randomization procedure is fully known and reproducible. This combination, represented by the top-left corner of Table 6.1, corresponds to RCEs as defined above. The rows of the table correspond exactly with condition (D) in the definition of a RCE. Condition (P) in the definition is satisfied implicitly if we assume that when a researcher designs and controls the assignment, he or she chooses a probabilistic assignment.8 And the unconfoundedness assumption (U) is implied by the assumption that the treatment allocation probabilities are fully known.9 8 If the assignment has pi ∈ {0, 1} for some units, the population of interest could be redefined to only include those units whose probabilities are neither zero nor one. 9 For example, if a researcher uses a higher probability of treatment assignment for patients who are known to benefit the most from treatment, this would appear to violate the unconfoundedness assumption. However, since we are assuming that the researcher designed the experiment, he or she would know and be able to reproduce all treatment assignment probabilities for every unit, thus making the high/low potential benefit strata fully observable, which would restore unconfoundedness (conditional on potential benefit). Even if all units are assigned to treatment with a different probability and there are no strata, knowing these probabilities is sufficient to consistently estimate the average treatment effect and perform exact Fisherian inference based on the sharp null hypothesis. As long as all probabilities are fully

Being in charge of the design and implementation of the randomized experiment, however, is a not a necessary condition to having full knowledge of the assignment mechanism. Researchers often discover randomized experiments that are designed and implemented by third parties such as policymakers. In some cases, the third party is willing to disclose all details regarding the assignment mechanism, and as a consequence all probabilities of treatment assignment become known to the researcher despite him or her not being in direct control of the experiment. In Definition RTPE, I called this a randomized third-party experiment. RTPEs belong in the bottom-left cell of Table 6.1, where probabilities are known or knowable but the experiment is either not designed or not implemented by the researcher, or possibly both. If the treatment assignment mechanism is known, then even when the researcher is not in control of the assignment as in Titiunik (2016), welldefined treatment effects are identifiable, inference methods for the analysis of randomized experiments are ensured to be valid, and assumptions are falsifiable. Regardless of who designs and implements the experiment, if the probabilities associated with each possible treatment allocation are unknown to the researcher, the design is nonexperimental – also known as an “observational study.” My definition of an observational study follows Imbens and Rubin (2015), who define it as a study in which “the functional form of the assignment mechanism is unknown” (p. 41). In contrast to a RTPE, where the lack of direct design or implementation is accompanied by knowledge of the probability of occurrence of each treatment allocation, in an observational known, the possibility of violating unconfoundedness does not arise or is inconsequential.

118

Rocío Titiunik

study the researcher fundamentally ignores or has no access to these probabilities. In practice, cases that belong to the topright cell of Table 6.1 are rare because a randomized experiment that is designed and implemented by the researcher typically implies that the treatment assignment mechanism is fully known to the researcher. However, there might be cases where the researcher controls the treatment assignment, but either the design or the implementation is faulty and as a consequence the exact treatment allocation probabilities are unknown – examples include the Vietnam lottery and the Lanarkshire milk experiment mentioned above. Given the above distinctions, I now introduce a new definition of natural experiment. Definition NE (Natural Experiment). A natural experiment is a study in which the assignment mechanism satisfies the following properties:  Pr(Z|X, Y(0), Y(1)) is neither designed (D) nor implemented by the researcher.  Pr(Z|X, Y(0), Y(1)) is unknown and (K) unknowable to the researcher. ( P) Pr(Z|X, Y(0), Y(1)) is probabilistic by virtue of an external event or intervention that is outside the experimental units’ direct control. This definition is intentionally analogous to my prior definitions of a RCE and a RTPE to facilitate a comparison. A natural experiment is a research design where the researcher is in charge of neither the design of the treatment assignment mechanism nor  Moreover, its implementation (condition D). the treatment assignment mechanism is  unknown and unknowable (condition K), which means that the researcher does not know and has no way of knowing the probabilities associated with each possible treatment allocation. The latter condition – assignment mechanism unknowable – immediately implies that a natural experiment is an observational study. The third and last condition in the definition ( P) captures what has often been invoked as the main feature of a natural experiment:

its unpredictability as a result of the assignment mechanism’s dependence on an external factor. A natural experiment is a special kind of observational study where the mechanism that allocates treatment is known to depend on an external factor. In my definition, this external factor is assumed to be the source of randomness that results in a probabilistic assignment mechanism and thus captures the unpredictable component that has been emphasized in prior characterizations of natural experiments. Note that condition  P is not directly verifiable or falsifiable. Although the existence of the external factor will typically be immediately verifiable, verifying that this external factor resulted in a probabilistic assignment will be considerably more difficult and often impossible. Thus, classifying an observational study as a natural experiment will require assuming that the external forces of nature that intervened in the assignment of treatment did so in such a way as to produce a probabilistic assignment. The justification of this assumption will often rest on the argument that the experimental units have no ability to directly control the external factor and thus have no ability to choose their treatment condition deterministically. This is a heuristic rather than a formal argument, as the units’ lack of control of their own assignment is not by itself sufficient to ensure a probabilistic assignment – rather, the lack of control introduced by the external factor is simply used as the basis for assuming that the assignment was governed, at least partly, by chance. In a standard observational study, it is often impossible for the researcher to know which, if any, of the units that actually took the treatment were ex ante at risk of not taking it. In contrast, in a natural experiment, there is an external factor that serves as the basis for making such an assumption. Although the probability of receiving treatment is still possibly a function of potential outcomes, it is also affected by an external factor over which the units have no precise control. For example, even though families can choose to invest in more durable construction materials to protect against earthquakes or floods, the

Natural Experiments

severity of natural disasters is not under any family’s control, and thus it is impossible for a family to precisely and perfectly guarantee that their house will not be destroyed by a natural disaster, which introduces an element of chance as to which houses are in fact destroyed. The distinction is similar to that introduced by Lee (2008) between “systematic or predictable components that can depend on individuals’ attributes and/or actions” and a “random chance component” that is uncontrollable from the point of view of the unit (Lee 2008, p. 681). Crucially, the externality is not absolute, but relative to the units who are receiving the treatment. This external factor implies that the units lack precise control over the treatment condition they will receive, and thus that the treatment assignment mechanism is not fully under the control of the units who are the subjects of the study. Thus, external means “external to units,” not necessarily to other actors. For example, in the Lassen (2005) study, the assignment of districts to the decentralization condition depended on various factors. Some of those factors are units’ characteristics such as population size and suburban status. These are examples of characteristics X that may be correlated with the units’ potential outcomes and determined before the treatment is assigned. But Lassen’s account of the decision process that governed the decentralization policy suggests that, despite their different characteristics, all of the districts in the sample were at risk of being assigned to the decentralization group. The assumption of probabilistic assignment is supported by the policymakers’ account of how the decentralization policy was carried out. ˜ holds However, even if condition (P) and the assignment is in fact probabilistic by virtue of the external factor, there remains a crucial obstacle. The central distinction between a RCE or RTPE and a natural experiment as I have defined it is that, in a natural experiment, the exact probabilities with which each possible treatment allocation could have occurred are fundamentally unknown. Thus, even if

119

the external factor prevents the experimental units from having precise control over which treatment condition they receive, the researcher has fundamental uncertainty about the actual probabilities associated with each allocation. Thus, a research design that satisfies Definition NE is still insufficient to identify or make inferences about causal effects, and researchers need to invoke additional assumptions. I elaborate on this issue in the following two sections, after discussing the particular case of the regression discontinuity design. 6.4.1 Is the Regression Discontinuity Design a Natural Experiment? I now discuss whether Definition NE applies to the regression discontinuity (RD) design, a research design that has become widely used throughout the social and behavioral sciences (for overviews, see Cattaneo et al. 2020a, 2020c). Part of the popularity of the RD design stems from the idea that the RD treatment assignment resembles the assignment in RCEs, and thus its credibility is similar to the credibility of an actual experiment. The notion of “as-if random” or “akin to random” appears frequently in discussions of RD designs, which suggests that any general discussion surrounding natural experiments should apply to RD designs in particular. A RD design is a study in which all units receive a score (also known as a running variable), and a treatment is allocated according to a specific rule that depends on the unit’s score and a known cutoff. In the simplest, binary treatment case, the rule assigns the treatment condition to units whose score is above the cutoff and assigns the control condition to units whose score is below it. Letting Ri be the score for units i = 1, 2, . . . , n and r0 be the cutoff, each unit’s treatment assignment is Ti = 1(Ri ≥ r0 ). This rule implies that, conditional on R, the treatment assignment is deterministic, since P(Ti = 1| Ri ≥ r0 ) = 1 and P(Ti = 1|Ri < r0 ) = 0. All RD designs rely on this discontinuous change in the probability of treatment assignment to study the effect of the

120

Rocío Titiunik

treatment at the cutoff, under the assumption that this probability is the only relevant feature of the data-generating process that changes discontinuously at the cutoff – or, more precisely, under the assumption that the distribution (or expectation) of the units’ potential outcomes is continuous at the cutoff. A canonical RD example, first introduced by Lee (2008), is one in which the treatment of interest is winning an election, and the score is the vote share obtained by a political party. Under plurality rules with only two candidates, the party wins the election if it obtains 50% of the vote or more and it loses otherwise. Although districts where the party wins will not in general be comparable to districts where the party loses, one interpretation of the RD design poses that in districts where the election is very close, chance plays a role in deciding the ultimate winner. Some scholars have claimed that the RD treatment assignment rule induces variation in the treatment assignment that is as good as the variation induced by a RCE, elevating RD designs above most other observational studies. The analogy between RD designs and randomized experiments has been invoked frequently to justify the classification of the RD design as an almost-experiment and its treatment assignment as “as-if random.” DiNardo (2016), p. 7 observes that “if we focus our attention on the difference in outcomes between ‘near winners’ and ‘near losers’ such a contrast is formally equivalent to a randomized controlled trial if there is at least some ‘random’ component to the vote share.” Lee (2008), p. 676 argues that “causal inferences from RD designs can sometimes be as credible as those drawn from a randomized experiment,” while Lee and Lemieux (2010) call RD designs the “close cousins” of randomized experiments. These analogies between RD designs and randomized experiments are based on the role of unpredictability in the final treatment assignment. Dunning (2012) sees unpredictability as the source of comparability, asserting that “given the role of unpredictability and luck in exam performance, students just above and below

the key threshold should be very similar, on average.” Lee (2008) also views uncertainty as the source of comparability, asserting that “Even on the day of an election, there is inherent uncertainty about the precise and final vote count. In light of this uncertainty, the local independence result predicts that the districts where a party’s candidate just barely won an election ... are likely to be comparable in all other ways to districts where the party’s candidate just barely lost the election” (Lee 2008, pp. 676–677). The RD design fits the definition of a natural experiment that I introduced above. Its assignment mechanism is typically neither designed nor controlled by the researcher. Moreover, although it seems that the RD treatment rule T = 1(R ≥ r0 ) makes the assignment mechanism fully known, it is only known conditional on R. Given a unit’s score value, the researcher knows whether the probability of being assigned to treatment was zero or one. However, the researcher fundamentally ignores the probability distribution of the score R, which implies that, in any window around the cutoff, certain types of individuals could have been more likely than others to receive a score above the cutoff. If types correlate with potential outcomes, then units barely above and barely below the cutoff will not be comparable unless we condition on type. Sekhon and Titiunik (2017) discuss this point at length and show that random assignment of the RD score in a neighborhood of the cutoff does not imply that the potential outcomes and the treatment are statistically independent, nor that the potential outcomes are unrelated to the score in this neighborhood. This distinction is analogous to the distinction between probabilistic and unconfounded assignment. The element of chance contained in the ultimate value of the score that a unit receives implies that the assignment mechanism is probabilistic. Consider a RD design where a scholarship is given to students whose grade in an exam is above a known threshold. Even good students can see their exam performance adversely affected by ambient noise, unexpected illnesses, or unreasonably hard questions.

Natural Experiments

This means that there is an element of chance in the ultimate grade that any student receives. This element of chance, in combination with the RD rule, implies that a student’s placement above or below the cutoff is a random variable. Its probability distribution, however, is fundamentally unknown to the researcher. Observing the scores assigned to the units in a RD design is analogous to observing the treatment status of each unit in an experiment where the probability of treatment assignment of each unit is hidden from or unknown to the researcher. This means that if we adopt a local randomization approach to RD designs (Cattaneo et al. 2015, 2017, 2020a, 2020b), where we focus on a window or neighborhood around the cutoff and use units whose scores are below the cutoff as a comparison group for treated units whose scores are above it, it is natural to imagine that treated units with Ri = r0 +  could have instead received a score of Ri = r0 −  and thus could have been assigned to the control group. It therefore seems plausible to assume that the treatment assignment in a small window around the cutoff is probabilistic, and it is probabilistic by virtue of the unpredictable components of R, in combination with the external RD rule T = 1(R ≥ r0 ). This implies that the RD design satisfies the definition of a natural experiment that I have proposed. My conclusion concurs with DiNardo’s (2016) and Dunning’s (2012) characterizations of the RD design as a natural experiment, but for different reasons. While these authors see the RD design as akin to an experiment, my understanding of the RD design as a natural experiment stems from its status as an observational study where an external rule justifies the assumption that the treatment assignment is probabilistic. Understanding RD designs as natural experiments in the sense of Definition NE separates the notion of chance from the notion of comparability: the probabilistic nature of the RD assignment implies neither that the RD assignment mechanism is knowable nor that it is equiprobable.

121

6.5 Advantages of Natural Experiments over Traditional Observational Studies A natural experiment is fundamentally different from a RCE because its treatment assignment mechanism is unknown and unknowable to the researcher. For this reason, in the hierarchy of credibility of research designs for program evaluation, natural experiments rank below RCEs.10 At the same time, the most convincing natural experiments rank above other observational studies where the assignment mechanism is not known to depend on a verifiable external factor. The reason for this is that natural experiments, by virtue of the assignment’s dependence on this external factor, offer clear guidelines to distinguish a pretreatment from a post-treatment period. Moreover, in some cases, the external factor in natural experiments offers a plausible claim of unconfoundedness. I shall refer to an observational study where no external factor is known to affect treatment assignment as a traditional observational study. As an example of such a study, I consider the influential analysis of the determinants of political participation by Brady et al. (1995). These authors propose a resource theory of political participation that expands the traditional socioeconomic status (SES) model that focused on income and education as determinants of political participation. Their expanded model is centered on three types of resources: time, money, and civic skills. Their hypothesis is that the amount of each of these three resources available to an individual has a positive effect on that individual’s political participation. The data come from a representative telephone survey of the US adult population that collected self-reported data on respondents’ political and civic participation and also demographic and economic characteristics. 10 Deaton and Cartwright (2018) (and see also Deaton 2010, 2020) reject the idea that research designs can be ranked in terms of credibility. In response, Imbens (2010) argues that such a ranking is possible in a ceteris paribus sense (see also Imbens 2018).

122

Rocío Titiunik

Both the outcome (political participation) and the treatments of interest (time, money, and civic skills) are measured with data from this survey. In particular, money resources are measured as self-reported family income; civic skills are measured with educational attainment questions, a vocabulary test, and self-reported participation in nonpolitical organizations such as churches and schools; and time is measured as the hours left in an average day after subtracting time spent sleeping, working, studying, and doing household work. A comparison of this traditional observational study with the natural experiment by Lassen (2005) offers important lessons. The assignment mechanism is unknown in both cases. Similarly to the Lassen (2005) study, where the probability of each possible allocation of districts to the decentralization condition is unknown, in the Brady et al. (1995) study we ignore the probability that each individual will receive a given endowment of money, education, language ability, and free time. There is, however, a fundamental difference. In Lassen (2005), the allocation of districts to the decentralization intervention was the result of a governmental policy. This policy was decided by a third party, not by the districts themselves (though we cannot rule out that districts had some influence in determining their own assignment). Moreover, the external mechanism that decided the allocation of districts has a time stamp and is verifiable. These two features apply to natural experiments generally and translate into two concrete advantages over traditional observational studies. The time stamp allows the researcher to identify a pretreatment period and distinguish it from the posttreatment period. And the verifiability of the external mechanism can, in some cases, justify an uncounfoundedness assumption. I discuss both issues below. 6.5.1 Pretreatment Period and Falsification In a natural experiment, the assignment mechanism depends on an external factor. As argued at length above, knowledge of

this external factor is not sufficient to fully know the probability distribution of the assignment. However, because the occurrence of the external factor is a necessary condition for the treatment to be assigned, the time period when the external event occurs serves as a natural delimiter. Unlike traditional observational studies, natural experiments allow the researcher to establish objectively the time period when treatment assignment occurs, because he or she can record when the external intervention was initiated. This treatment assignment time stamp is crucial for falsification purposes. Once the researcher collects information about the moment when the treatment was given to the units, the periods before and after the treatment assignment are easily established – the period before the treatment is commonly referred to as the pretreatment period. An important falsification analysis is available if researchers can collect information on a set of covariates X measured during the pretreatment period. By virtue of having been measured in this period, these variables will be determined before the treatment is assigned, and thus the effect of treatment on them is zero by construction. Thus, the variables X can be used to implement a falsification analysis that is common in the analysis of randomized experiments: by analyzing whether the treatment has in fact no effect on the covariates, researchers can offer empirical evidence regarding the comparability of treated and control groups. As in randomized experiments, the usefulness of this so-called “covariate balance” analysis depends on the type of variables that are included in X. The most convincing falsification analysis will be one where these variables are strongly correlated with both the outcome and the factors that affect the propensity to receive the externally assigned treatment.11 On this aspect, natural

11 For example, in the Lassen (2005) study, one could analyze the share of the population that is college educated, which is known to correlate with voter turnout (the outcome of interest), and is also correlated with socioeconomic indicators such as income

Natural Experiments

experiments do not differ much from RCEs and RTPEs. However, there is one crucial difference. The correct implementation of a covariate falsification analysis depends on the assignment mechanism, which in a natural experiment is unknown. When the assignment mechanism is equiprobable, the distribution of X is expected to be the same when the entire control group is compared to the entire treatment group, and thus the falsification test can be implemented with unadjusted covariate balance tests that compare all treated units versus all control units.12 However, if the treatment assignment probabilities are different for different subgroup of units, the proper implementation of a covariate balance test requires us to weight or stratify the analysis based on these probabilities. In natural experiments, however, these probabilities are unknown, so such adjustment is unavailable. This suggests that an unadjusted covariate balance test is a useful tool to establish the plausibility of the equiprobable assignment assumption. For implementation, researchers assume that the assignment mechanism is equiprobable and test the implication that the unadjusted distribution of X is equal in the treatment and control groups. If the hypothesis that the treated and control covariate distributions are equal is rejected, then the assumption of equiprobable assignment is unsupported by the data. This is an important first step toward gaining a deeper understanding of the assignment mechanism. The implementation of this falsification analysis is straightforward in the Lassen (2005) study. The decentralization intervention occurred in 1995 when the Copenhagen Municipality Structural Commission selected the districts that would be decentralized. Thus, all district-level variables collected and poverty that might make decentralization (the treatment) more or less desirable. 12 An equiprobable assignment is one in which every unit has the same probability of receiving treatment, but not necessarily one in which this probability is equal to 50%. As long as this probability is constant for all units, the distribution of covariates in the treatment and control group will be the same.

123

before 1995 are pretreatment and can be used in a falsification analysis. This could include census counts, economic indicators, etc. In contrast, in the Brady et al. (1995), the pretreatment period is impossible to identify with certainty because it is unclear when the treatments of time, money, and civic skills are in fact assigned. For example, if an individual reports high levels of civic skills as measured by a vocabulary test, what exactly is the period before these skills were developed? We know that language skills are susceptible to stimulation from an early age, and toddlers and even infants who are exposed to rich language environments have stronger language skills. We cannot rule out that people with high vocabulary skills have been exposed to this treatment since early childhood. A similar argument can be applied to the money and time treatments. This implies that a pretreatment period is unavailable and all covariates are in fact posttreatment covariates. Therefore, there are no covariates available with which to implement a falsification analysis. 6.5.2 Verifiability of Externality and Unconfoundedness When the empirical evidence shows that the distribution of relevant predetermined covariates differs between the treatment and the control group, the assumption of equiprobable assignment is implausible. This means that the data do not support the assumption that all units were assigned to the treatment condition with the same probability. Without additional knowledge, it is not possible to identify causal treatment effects in a design-based fashion. However, the most convincing natural experiments might offer a reasonable justification for the assumption that the assignment is unconfounded given some observable predetermined covariates. The credibility of this justification is based directly on the externality of the treatment assignment that characterizes natural experiments. As I have defined it, a natural experiment is a setting in which the treatment assignment mechanism is known to be probabilistic

124

Rocío Titiunik

by virtue of it depending on an external factor. In some natural experiments, the researcher has enough information about the variables on which the external intervention depended. In these cases, the researcher might credibly assume that, after these variables are conditioned on, the probability of treatment assignment is not a function of the units’ potential outcomes. The credibility of such an assumption should be judged on a case-by-case basis. For example, in the Lassen (2005) study, the exact functional form of the assignment mechanism is unknown, but a qualitative investigation of the decision-making process revealed that the decision of which districts to decentralize was based on the districts’ populations and levels of socioeconomic development, with the explicit goal of ensuring that the decentralized districts were as diverse as the total population of districts in terms of these covariates. This feature of the assignment, which is directly verifiable with qualitative information issued by the Copenhagen Municipal Commission, can be used as the basis for the unconfoundedness assumption that the probability of decentralization is unrelated to the districts’ potential outcomes after conditioning on population and socioeconomic development. Note an important difference between the unconfoundedness assumption and the equiprobable assignment assumption: the latter is empirically testable, but the former is not. Because covariate balance is an implication of an equiprobable assignment mechanism, we can use covariate balance tests to falsify the assumption that the assignment mechanism is equiprobable. However, in the absence of additional assumptions, the unconfoundedness assumption is fundamentally untestable. This means that a justification for it has to rely more heavily on the qualitative information about the assignment mechanism and stands on weaker evidentiary ground. The assumption of unconfounded assignment is always strong, but on this respect natural experiments have an advantage over traditional observational studies: the dependence of the assignment mechanism

on external factors is verifiable. In the most convincing natural experiments, researchers are able to verify that the process that governed the treatment assignment depended on external factors, and prior scientific knowledge coupled with qualitative and/or qualitative data suggest that treatment assignment should be unrelated to potential outcomes conditional on those factors. If the researcher is able to collect information on those same factors and condition on them in the analysis, then the usual tools of program evaluation based on unconfoundedness – parametric adjustment models, propensity score analyses, matching estimators, etc. – are available for analysis. Thus, in a convincing natural experiment, the researcher uses the available information on the external assignment mechanism as a plausible basis to invoke an unconfoundedness assumption. This is unavailable in a traditional observational study, where there is usually no objective basis to claim that unconfoundedness holds for any set of covariates, given that we fundamentally ignore how (and when) the treatment was assigned. For example, in the Brady et al. (1995) study, what covariates should we condition on before we can assume that people with high levels of money, time, and civic resources are comparable to people who have low levels of those resources? Brady et al. condition on citizenship status because they reasonably assume that it is a “prerequisite for voting and might affect other kinds of participation as well.” But, even putting aside the concerns about establishing the pretreatment period, we can imagine many other factors such as geographic location, number of children, parents’ education, etc., that may affect both the propensity to participate in politics and the amount of time, money, and resources available to an individual. There is no objective information to guide the choice of the conditioning set. The decision to participate in politics, since it is made privately and is entirely under the control of each individual, is less transparent to the researcher than the decision to decentralize districts in Copenhagen. Unlike the Copenhagen

Natural Experiments

Municipal Commissions, which published a report on the decentralization process, individual citizens do not write reports detailing the process by which they arrived at the decision to participate in politics. This greater transparency about the assignment mechanism, and the separation between the units receiving the treatment and those assigning it, can imbue some natural experiments with a stronger research design and more objective basis to invoke the necessary identification assumptions. My choice of can in the prior sentence is deliberate and should not be replaced by do. I do not mean to claim that the “worst” natural experiment is always preferable to the “best” traditional observational study. Some natural experiments blatantly violate the equiprobable assignment assumption and provide a very weak basis for assuming unconfoundedness. Some traditional observational studies are carefully conducted and genuinely contribute to our scientific knowledge. My claims about a credibility hierarchy are made in the ceteris paribus spirit articulated by Imbens (2010) – in a given study, it is preferable to have a verifiable conditioning set and a clear time stamp attached to the treatment assignment, and no researcher would willingly give up such information.

6.6 Recommendations for Practice The preceding discussion suggests some general recommendations for empirical researchers who wish to estimate and interpret causal effects based on natural experiments. 6.6.1 Is the Assignment Probabilistic? The first step is to establish whether the assumption of a probabilistic assignment is met for the universe of units that the researcher wishes to analyze. As I have defined it, a crucial feature of a natural experiment is that its assignment mechanism is probabilistic by virtue of an external event that is outside of the units’ direct control. The researcher should establish whether

125

it is in fact the case that all units to be included in the study had a probability of receiving treatment strictly between zero and one. If some units were certain to either be affected or not affected by the intervention, they should be excluded from the study, as the usual causal parameters will not be identifiable. If some units are excluded, the researcher should redefine the parameter of interest and clarify in the analysis that the reported effects are estimating the effect of the intervention only for units whose probability of being treated was neither zero nor one. The researcher should carefully characterize this new parameter. The caveat is that the assumption of probabilistic treatment assignment is not directly verifiable or testable, because untreated units could be untreated either because their ex ante treatment assignment probability is zero or because it is positive but the realization of the assignment is the control condition. For this reason, researchers should use prior scientific knowledge and/or qualitative and quantitative information regarding the external process that assigned the treatment to justify the probabilistic assignment assumption. 6.6.2 Is the Assignment Equiprobable? The second step is to assume that the assignment mechanism is equiprobable and to test the implication that the distribution of relevant pretreatment covariates is equal in the treatment and control groups. This falsification analysis starts by selecting a group of relevant pretreatment covariates X and testing the null hypothesis that the means and other features of the distribution of these covariates are the same in the treated and control groups. If the hypothesis of covariate balance is not rejected, the analysis can proceed under the equiprobable assignment assumption using standard tools from the analysis of randomized experiments (e.g., Athey and Imbens 2017; Gerber and Green 2012; Imbens and Rubin 2015) – with the caveat that in natural experiments, unlike in RCEs or RTPEs, this assumption is not known to be true and its credibility might be

126

Rocío Titiunik

disputed by other analyses. If the hypothesis of covariate balance is rejected, then the assumption of equiprobable assignment is unsupported by the data. Of course, researchers should ensure that their tests have enough statistical power to avoid mistakenly interpreting the failure to reject a false null hypothesis of covariate balance as supportive of the equiprobable assignment assumption. 6.6.3 Is the Assignment Unconfounded? An assignment mechanism that is not equiprobable could still be unconfounded. When the data do not support the assumption of equiprobable assignment, researchers should explore whether it is plausible to assume that there exists a covariate-based adjustment that renders the treated and control groups comparable. In this second stage of falsification, researchers can use the external assignment mechanism of the natural experiment to offer a plausible basis to adopt an unconfoundedness assumption. This justification should be based on objective and verifiable information about the treatment assignment mechanism that identifies a set of covariates that were explicitly used in the assignment, as in the Lassen study. Assuming that the researcher has access to these covariates, the analysis can proceed under the assumption of unconfoundedness given these covariates using standard estimation and inference methods from the unconfoundedness toolkit (e.g., Abadie and Cattaneo 2018; Imbens and Rubin 2015) – again, with the caveat that this assumption is not known to be true and might be disputed by later analyses. 6.6.4 Is the Natural Experiment of Substantive Interest? In most natural experiments, the treatment that is assigned is not exactly the treatment that a researcher would have assigned if he or she had been in charge of the execution of the study. This leads to very important and often difficult issues of interpretation. Even if all of the required identification

assumptions are satisfied, the treatment effect that is identifiable by the design may not be the effect of scientific interest. Sekhon and Titiunik (2012) illustrate this point with a redistricting natural experiment. Several researchers have used the periodic redrawing of legislative district boundaries in the US to study the incumbency advantage, comparing the vote share received by the same incumbent legislator in areas that are new to his or her district versus areas that have been part of the district for a long time. Even if precincts were randomly moved to new districts according to a known probability distribution, this assignment would never achieve comparability between new and old voter areas in terms of their prior history (e.g., party or race of prior incumbent), because new voters are coming from a different incumbent by construction. In terms of the prior discussion, this occurs because the probability of old voters being selected as new voters is zero, and thus the overall assignment is not probabilistic for this population. The natural experiment externally introduces variation in the voters that an incumbent receives in his or her district. Whether this variation is useful for studying the incumbency advantage of interest to scholars of American politics is a separate matter. Such issues of interpretation should be at the forefront of any analysis based on natural experiments.

6.7 Conclusion The literature has offered several definitions of a natural experiment, not necessarily consistent with one another. I sought to partly resolve the ambiguity by going back to the definition of a RCE, and contrasting the canonical natural experiment to it. As I have defined it, a natural experiment is a study in which the treatment assignment mechanism is neither designed nor implemented by the researcher, is unknown and unknowable to the researcher, and is probabilistic by means of an external event or intervention that is outside of the control of the units who are the subjects of the intervention.

Natural Experiments

In order to arrive at this definition, I have emphasized several conceptual distinctions. A central conclusion is that a RCE’s defining feature is not that the treatment assignment is random, in the sense of being a random variable with some distribution, because this would imply that all interventions, programs, and individual decisions ever taken are randomized experiments. That a citizen’s decision to vote is a random variable does not imply that a comparison of voters and nonvoters is a randomized experiment. The key is not that the treatment must have a distribution (all random variables do), but rather that the experimenter must know what this distribution is. The power of a RCE (and a RTPE) is therefore not only in the randomization itself, but also in the knowledge and properties of the assignment distribution that the randomization implies. In a RCE, the unconfoundedness assumption guaranteed by the physical randomization device is as crucial as the ex ante unpredictability of each individual’s treatment assignment. In contrast, natural experiments retain the unpredictability, but discard knowledge of the assignment mechanism and the unconfoundedness guarantees. Because natural experiments have, by definition, a treatment assignment mechanism that is unknowable to the researcher, they rank – everything else equal – unambiguously below RCEs in terms of credibility and reproducibility. Nonetheless, natural experiments offer two important advantages over traditional observational studies. First, by defining the moment when the intervention of interest occurs, they clearly demark a pretreatment period, which is essential to falsify the assumption of equiprobable assignment and also to condition on covariates in a valid way. Second, in cases where the equiprobable assignment assumption does not hold, the best natural experiments offer a plausible and verifiable justification for an unconfoundedness assumption. Both the time stamp that delimits pre- and post-treatment periods and the objective justification for the unconfoundedness assumption are often lacking in traditional observational studies.

127

References Abadie, Alberto, and Matias D. Cattaneo. 2018. “Econometric methods for program evaluation.” Annual Review of Economics 10: 465–503. ˘ Abdulkadiroglu, Atila, Joshua D. Angrist, Yusuke Narita, and Parag A. Pathak. 2017. “Research design meets market design: Using centralized assignment for impact evaluation.” Econometrica 85(5): 1373–1432. Aldous, David, and Persi Diaconis. 1986. “Shuffling cards and stopping times.” American Mathematical Monthly 93(5): 333–348. Angrist, Joshua D. 1990. “Lifetime earnings and the Vietnam era draft Lottery: Evidence from social security administrative records.” American Economic Review 80(3): 313–336. Angrist, Joshua D., and Alan B. Krueger. 1991. “Does compulsory school attendance affect schooling and earnings?” Quarterly Journal of Economics 106(4): 979–1014. Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91(434): 444–455. Angrist, Joshua D., and Alan B. Krueger. 2001. “Instrumental variables and the search for identification: From supply and demand to natural experiments.” Journal of Economic perspectives 15(4): 69–85. Athey, Susan, and Guido W. Imbens. 2017. “The econometrics of randomized experiments.” In Handbook of Economic Field Experiments. Vol. 1. Amsterdam: Elsevier, pp. 73–140. Berger, James O. 1990. Randomization. London: Palgrave Macmillan UK, pp. 208–210. Bhavnani, Rikhil R. 2009. “Do electoral quotas work after they are withdrawn? Evidence from a natural experiment in India.” American Political Science Review 103(1): 23–35. Brady, Henry E., Sidney Verba, and Kay Lehman Schlozman. 1995. “Beyond SES: A resource model of political participation.” American Political Science Review 89(2): 271–294. Carbone, Jared C., Daniel G. Hallstrom, and V. Kerry Smith. 2006. “Can natural experiments measure behavioral responses to environmental risks?” Environmental and Resource Economics 33(3): 273–297. Card, David, and Alan Krueger. 1994. “Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania.” American Economic Review 84(4): 772–793.

128

Rocío Titiunik

Cattaneo, Matias D., Brigham Frandsen, and Rocio Titiunik. 2015. “Randomization inference in the regression discontinuity design: An application to party advantages in the U.S. Senate.” Journal of Causal Inference 3(1): 1–24. Cattaneo, Matias D., Nicolás Idrobo, and Rocío Titiunik. 2020a. A Practical Introduction to Regression Discontinuity Designs: Foundations. Cambridge Elements: Quantitative and Computational Methods for Social Science. Cambridge, UK: Cambridge University Press. Cattaneo, Matias D., Nicolás Idrobo, and Rocío Titiunik. 2020b. A Practical Introduction to Regression Discontinuity Designs: Extensions. Cambridge Elements: Quantitative and Computational Methods for Social Science. Cambridge, UK: Cambridge University Press, in preparation. Cattaneo, Matias D., Rocio Titiunik, and Gonzalo Vazquez-Bare. 2017. “Comparing inference approaches for RD designs: A reexamination of the effect of head start on child mortality.” Journal of Policy Analysis and Management 36(3): 643–681. Cattaneo, Matias D., Rocio Titiunik, and Gonzalo Vazquez-Bare. 2020c. “The regression discontinuity design.” In The SAGE Handbook of Research Methods in Political Science and International Relations. Thousand Oaks, CA: SAGE Publishing, pp. 835–857. Clayton, Amanda. 2015. “Women’s political engagement under quota-mandated female representation: Evidence from a randomized policy experiment.” Comparative Political Studies 48(3): 333–369. Cook, Thomas D., and Donald T. Campbell. 1979. The Design and Conduct of True Experiments and Quasi-Experiments in Field Settings. Boston, MA: Houghton Mifflin. Cook, Thomas D., Donald Thomas Campbell, and William Shadish. 2002. Experimental and QuasiExperimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin. Craig, Peter, Srinivasa Vittal Katikireddi, Alastair Leyland, and Frank Popham. 2017. “Natural experiments: An overview of methods, approaches, and contributions to public health intervention research.” Annual Review of Public Health 38: 39–56. Deaton, Angus. 2010. “Instruments, randomization, and learning about development.” Journal of Economic Literature 48: 424–455. Deaton, Angus. 2020. “Randomization in the tropics revisited: A theme and eleven variations.” In Randomized Controlled Trials in the Field of Development: A Critical Perspective, eds. Florent Bédé-

carrats, Isabelle Guérin, and François Roubaud. Oxford: Oxford University Press, Revised. Deaton, Angus, and Nancy Cartwright. 2018. “Understanding and misunderstanding randomized controlled trials.” Social Science & Medicine 210: 2–21. DiNardo, John. 2016. “Natural experiments and quasi-natural experiments.” The New Palgrave Dictionary of Economics. Berlin: Springer, pp. 1–12. Downey, Rodney G., and Denis R. Hirschfeldt. 2010. Algorithmic Randomness and Complexity. Berlin: Springer Science + Business Media. Dunning, Thad. 2008. “Improving causal inference: Strengths and limitations of natural experiments.” Political Research Quarterly 61(2): 282–293. Dunning, Thad. 2012. Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge, UK: Cambridge University Press. Erikson, Robert S., and Laura Stoker. 2011. “Caught in the draft: The effects of Vietnam draft lottery status on political attitudes.” American Political Science Review 105(2): 221–237. Fienberg, Stephen E. 1971. “Randomization and social affairs: the 1970 draft lottery.” Science 171(3968): 255–261. Fisher, Ronald A. 1935. “The design of experiments.” In Statistical Methods, Experimental Design, and Scientific Inference: A Re-issue of Statistical Methods for Research Workers, The Design of Experiments, and Statistical Methods and Scientific Inference, ed. J. H. Bennett. Oxford: Oxford University Press, 1st edition, 1990. Fisher, Ronald A. 1956. “Statistical methods and scientific inference.” In Statistical Methods, Experimental Design, and Scientific Inference: A Re-issue of Statistical Methods for Research Workers, The Design of Experiments, and Statistical Methods and Scientific Inference, ed. J. H. Bennett. Oxford: Oxford University Press, 1st edition, 1990. Fuchs-Schündeln, Nicola, and Tarek Alexander Hassan. 2016. “Natural experiments in macroeconomics.” In Handbook of Macroeconomics. Vol. 2. Amsterdam: Elsevier, pp. 923–1012. Gerber, Alan S., and Donald P, Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W. W. Norton. Gould, Eric D., Victor Lavy, and M. Daniele Paserman. 2004. “Immigrating to opportunity: Estimating the effect of school quality using a natural experiment on Ethiopians in Israel.” Quarterly Journal of Economics 119(2): 489–526. Heckman, James J., Hidehiko Ichimura, and Petra Todd. 1998. “Matching as an econometric

Natural Experiments evaluation estimator.” Review of Economic Studies 65(2): 261–294. Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American Statistical Association 81(396): 945–960. Imbens, Guido. 2018. “Understanding and misunderstanding randomized controlled trials: A commentary on Deaton and Cartwright.” Social Science & Medicine 210: 50–52. Imbens, Guido W. 2010. “Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009).” Journal of Economic literature 48(2): 399–423. Imbens, Guido W., and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge, UK: Cambridge University Press. Imbens, Guido W., and Jeffrey M. Wooldridge. 2009. “Recent developments in the econometrics of program evaluation.” Journal of Economic Literature 47(1): 5–86. Keynes, John Maynard. 1921. A Treatise on Probability. London: Macmillan & Co. Knight, Frank H. 1921. Risk, Uncertainty and Profit. Chelmsford, MA: Courier Corporation. Lalive, Rafael, and Josef Zweimüller. 2009. “How does parental leave affect fertility and return to work? Evidence from two natural experiments.” Quarterly Journal of Economics 124(3): 1363–1402. Lassen, David Dreyer. 2005. “The effect of information on voter turnout: Evidence from a natural experiment.” American Journal of Political Science 49(1): 103–118. Lee, David S. 2008. “Randomized experiments from non-random selection in U.S. House elections.” Journal of Econometrics 142(2): 675–697. Lee, David S., and Thomas Lemieux. 2010. “Regression discontinuity designs in economics.” Journal of Economic Literature 48(2): 281–355. Lehmer, Derrick H. 1951. “Mathematical methods in large-scale computing units.” Annals of Computation Laboratory of Harvard University 26: 141–146. LeRoy, Stephen F., and Larry D Singell Jr. 1987. “Knight on risk and uncertainty.” Journal of Political Economy 95(2): 394–406. Merriam-Webster, Online Dictionary. 2015. Springfield, MA: Merriam-Webster, Inc. Meyer, Breed D. 1995. “Natural and quasiexperiments in economics.” Journal of Business & Economic Statistics 13(2): 151–161. Miguel, Edward, Shanker Satyanath, and Ernest Sergenti. 2004. “Economic shocks and civil

129

confict: An instrumental variables approach.” Journal of Political Economy 112(4): 725–753. Neyman, Jerzy. 1923 [1990]. “On the application of probability theory to agricultural experiments. Essay on principles. Section 9.” Statistical Science 5(4): 465–472. Park, Stephen K., and Keith W. Miller. 1988. “Random number generators: Good ones are hard to find.” Communications of the ACM 31(10): 1192–1201. Petticrew, Mark, Steven Cummins, Catherine Ferrell, Anne Findlay, Cassie Higgins, Caroline Hoy, Adrian Kearns, and Leigh Sparks. 2005. “Natural experiments: An underused tool for public health?” Public Health 119(9): 751–757. Pincus, Steve, and Burton H. Singer. 1996. “Randomness and degrees of irregularity.” Proceedings of the National Academy of Sciences of the United States of America 93(5): 2083–2088. Pincus, Steve, and Rudolf E. Kalman. 1997. “Not all (possibly) ‘random’ sequences are created equal.” Proceedings of the National Academy of Sciences of the United States of America 94(8): 3513–3518. Rosenbaum, Paul R. 2002. Observational Studies. Berlin: Springer. Rosenzweig, Mark R., and Kenneth I. Wolpin. 2000. “Natural ‘natural experiments’ in economics.” Journal of Economic Literature 38(4): 827–874. Rubin, Donald B. 1974. “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology 66(5): 688. Rutter, Michael. 2007. “Proceeding from observed correlation to causal inference: The use of natural experiments.” Perspectives on Psychological Science 2(4): 377–395. Sekhon, Jasjeet S., and Rocio Titiunik. 2012. “When natural experiments are neither natural nor experiments.” American Political Science Review 106(01): 35–57. Sekhon, Jasjeet S., and Rocío Titiunik. 2017. “On interpreting the regression discontinuity design as a local experiment.” In Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, volume 38), eds. Matias D. Cattaneo and Juan Carlos Escanciano. Bingley: Emerald Group Publishing, pp. 1–28. Student. 1931. “The Lanarkshire milk experiment.” Biometrika 23(3/4): 398–406. Titiunik, Rocio. 2016. “Drawing your senator from a jar: Term length and legislative behavior.” Political Science Research and Methods 4(2): 293–316.

CHAPTER 7

Virtual Consent The Bronze Standard for Experimental Ethics∗

Dawn Langan Teele

Abstract Informed consent has been a mainstay of all ethical research guidelines since the 1970s, but the proliferation of field experiments in the social sciences – which include audit experiments, correspondence experiments, canvasing experiments, social media experiments, and information experiments – has brought with it an increasing resistance to procuring informed consent. This chapter grapples with the now common practice of denying research subjects an opportunity to voluntarily consent to participate in research. It provides a framework for thinking about virtual consent, a situation in which the researcher consents for participants. Drawing on a Rawlsian thought experiment, I argue that ethical research is that to which a reasonable person, not knowing whether he or she would be the subject or the scientist, would consent. This type of reasoning provides a way for thinking about potential downstream consequences not just for the individual subject, but also for society writ large. Yet, because virtual consent does not entail voluntary participation, in constitutes a bronze standard, rather than a best practice, for ethics in experiments. A much-discussed documentary, Three Identical Strangers, details the research design, scientific inspiration, and public response over an experiment carried out on orphaned triplets beginning in 1961 * For their probing comments, I thank James N. Druckman, Donald P. Green, Josh Simon, Anna Jurkeviks, Tara Slough, and Hannah Nam.

130

in the USA. The original experiment, conceived of and executed by two psychiatrists, intentionally placed the sixmonth-old triplets in homes with different socioeconomic statuses and parenting styles. It was designed to understand the differential impact of environmental versus biological factors on health, educational attainment, and personality of the siblings

Virtual Consent

131

thereafter.1 The importance of the debate between nature and nurture can hardly be disputed; whether we are interested in issues related to the development of personality, the likelihood of contracting specific diseases, including mental illness, or existential questions about the meaning of life, understanding the differential roles of biological or environmental factors is important. Because siblings and twins share considerable (and possibly identical) genetic material, this study provides an interesting opportunity to trace long-term outcomes while holding certain genetic factors constant. What is more, since the triplets were allocated to homes that varied on several important factors, genetic differences between siblings, such as phenotype or aptitude, did not impact which families they ended up with, diminishing a concern that there was selection bias into adoptive households. The study’s compelling design, and the potential import of the findings, gives it the markings of potentially transformative science. But was it ethical? For nearly five decades, experimental ethics, as with research ethics more broadly, have been guided by a set of criteria known as the Belmont Report. The Report’s three principles – beneficence, respect for persons, and justice – were elaborated in response to public outcry over certain types of research, including experiments carried out coercively on Jewish populations by the Nazis and experiments by American physician scientists carried out on Black men with syphilis. In the famous Tuskegee syphilis study, research continued even after the scientists knew that they could cure the disease using antibiotics. The white male Tuskegee researchers were critiqued for withholding relevant information from research subjects, for deceiving subjects about the nature of the research, and because their study

population consisted of a group sick African American men who had little education or agency to advocate for themselves against the physicians’ treatment (Jones 1981). The Belmont principles stipulate that participation in research should always be voluntary; that researchers should always secure informed consent from participants (where “informed” implies adequately briefed about potential risks); and that researchers should have to ask and answer the question of whether the findings of the study are relevant to the population on which the study is carried out to ensure that they are not burdening the poor, the old, the very young, or the weak, unless absolutely necessary for the scientific findings. Although the Belmont principles certainly seem to have a natural application to a large set of public policy experiments and are also meant to apply to behavioral research, the principles were designed with medical experiments in mind. As more and more social scientists have joined the experimental bandwagon, and as new forms of experiments have been elaborated, there has been an increasing number of studies in which participation in research is clearly not voluntary and where, by design, participants are not given the opportunity to consent (Desposato 2018), such as when the “treatment” of a study is carried out at a village instead of an individual level and people do not know they are being observed, or when treatments are administered over social media platforms and people do not understand the research purpose behind the memes, prompts, or ads they see.2 Because the suspension of informed consent has become commonplace, leading experimentalists have attempted to articulate new ideas about ethics that circumvent voluntary participation and informed consent. But as yet, no new apparatus has emerged that is clearly superior

1 Most of the findings from this study were not reported, though some appeared in Neubauer and Neubauer (1996). In 2019, after the documentary was released, the Journal of the American Medical Association published commentaries on the ethics of the experiment and the historical practice of separating the twins at birth; see Hoffman and Oppenheim (2019a, 2019b).

2 On the varieties of social media experiments and some of the ethical issues involved in carrying them out, see Chapter 10 in this volume. In that chapter, Guess claims that the simplest and least ethically dubious social media experiments involved collecting outcome measures – like “click-throughs” – that are often less interesting and theoretically relevant than what scholars hope to learn.

132

Dawn Langan Teele

to the Belmont Report for thinking about the ethical treatment of human subjects. Several scholars have pointed out that institutional review boards (IRBs) are unlikely to get us out of this pickle, as IRBs are more about protecting institutional liability than they are about ethics per se.3 The need for rethinking ethics for social scientific experiments is nowhere more apparent than on the issue of consent. In medical experiments, bodily integrity reigns supreme: individuals have the right to know about and decide which chemicals enter into their bodies and the treatments that are pursued for the sake of their health. Although generations of scientists and philosophers have made class, racial, and gender distinctions about which individuals have the right to bodily autonomy (and we still deny a fair amount of bodily autonomy to children and incarcerated people), today most people living in democracies would likely agree that scientists and governments should not administer medical interventions to other humans without supplying the maximum amount of information and without procuring voluntary consent.4

3 Klitzman (2015) investigates the structure and internal dynamics of IRBs at research universities. He samples a quarter of the top 240 institutions that receive National Institutes of Health (NIH) funding and then picked half of those for in-depth examination (resulting in 34 IRBs). In spite of a move in 1999 to establish “certified” IRB professionals, Klitzman documents the wide variety in professionalization, knowledge, and tasks across IRBs. He emphasizes an underappreciated set of power dynamics within the IRB, too, documenting how IRB administrators perceive their own vulnerability when interacting with powerful physicians and researchers (chapter 2). King and Sands (2015) also explain that IRBs also do not protect the individual researcher from opprobrium. See Bosk and De Vries (2004) for a discussion of the difficulties that qualitative researchers face when trying to get their research approved by these boards. 4 I think most would agree in principle. In practice, it is clear that scientific racism, eugenics, and the medical manipulation of people are not solely things of the long distant past; as Roberts (1997) documents in her study of Black women, as recently as the 1980s, many judges in American courts mandated sterilization for women who were on welfare. In addition, recent public health measures adopted to counter the “antivaccine” movement, including requiring school-age children to acquire measles vaccinations to enter public school, are indications that the American public

Since the treatment and the potential outcomes are so personal in medical experiments, it is practically self-evident that research participants should not be enrolled in an experimental study without their knowledge. The imperative of informed consent can seem less urgent in many types of social experiments, especially if researchers believe that the experiment poses little threat to the people that are being observed. Take, for example, the growing body of research on bureaucracies and bureaucrats (for a description of this literature, see Chapters 3 and 27 in this volume). Imagine an experiment designed to help us understand whether bureaucracies deliver services in a racially biased manner, where the race of a citizen with whom a given bureaucrat interacts is randomized, the study group is the bureaucracy, and the experimental subject is the bureaucrat, but the population for whom the results have the greatest implications are citizens themselves. In this setting, scholars might argue that informed consent is unnecessary since the bureaucrats are simply engaging in their day-to-day job, where they would presumably interact with citizens of many races, and that the experiment poses no specific risks to the bureaucrats. Minimal risk to the bureaucrat is ensured if no personal information about the bureaucrat is recorded. In this example, the idea of requiring fully informed consent of the bureaucrat may seem both onerous and likely to undermine any research findings. If bureaucrats tend to deliver services in a racially biased manner but are tipped off about the study and become uncharacteristically helpful to all citizens, researchers would wrongly conclude that there is no racial bias in service delivery. The fact that the bureaucrat–citizen study just described, and many other research designs with similar flavors, cannot be carried out under conditions of informed consent of the research subjects has led some of the foremost social science experimentalists to write about a trade-off between informed

supports some coercive medical intervention in the name of public health; see Wickenden (2019).

Virtual Consent

consent and measurement (e.g., Humphreys and Weinstein 2009).5 The idea behind this so-called trade-off is that when our ability to “identify” unbiased treatment effects in a study is undermined by informed consent, we have to think about the potential scientific impact of the study and decide whether the question is so important as to merit a suspension of the “respect for persons” maxim of the Belmont Report. In an earlier treatment of this issue, I argued that conceiving of this problem as a trade-off between measurement and ethics misunderstands what ethics are: An ethical dilemma arises when it is impossible to simultaneously meet the demands of two ethical principles, as, for example, when one is confronted with a situation in which lying to a friend is the only way to avoid insulting him. The ethical principles that conflict in this example are not to lie and also to be kind to others. The trade-off created by the Hawthorne effect between satisfying the principle of respect for persons by obtaining informed consent and generating unbiased measurements of causal effects does not have this character. It is more akin to the tradeoff, in criminal justice, between respecting a suspect’s Miranda rights and the prospects of securing evidence that will lead to conviction. In the latter trade-off, one can recognize that there may be some conceivable benefit to denying suspects their rights, while still demanding that the principles embodied in the Miranda rights be satisfied; the problem here is not an ethical dilemma, but one of comporting with an ethical principle or not. (Teele 2014, pp. 125–126)

The hard-line stance that I took above regarding informed consent was based on my reading of the Belmont Report as providing ethical guidance that was both reasonable and good. Yet, something with which I did 5 Connors et al. (2019) argue that there is an ethical trade-off between data sharing (making one’s raw data available to other scholars) and informed consent insofar as telling survey respondents that their responses, stripped of all identifying information, will be publicly available changes outcomes. Respondents performed worse on some questions and evinced social desirability bias when they knew their data would be shared. This problem might be overcome by careful wording of the consent form.

133

not grapple, but which will be addressed in this chapter, is the fact that there are different ethical principles that can guide behavior in any number of realms, and the Belmont Report provides just one way to conceive of ethics in human subjects research.6 In the pages that follow, I differentiate between three levels of consent based on whether or not participation is voluntary, and the amount of information research subjects have about the potential costs and benefits of the study for their lives. The highest level, the “gold standard,” is one in which participation in research is voluntary and research subjects have full information; under the “silver standard,” participation is voluntary and research subjects have partial information; finally, there is a “bronze standard,” in which research participation is involuntary and research subjects are given no information. An ethics of deliberative decision-making would almost certainly push back against research where participation is involuntary and subjects are uninformed. But the fact is, many experiments currently being fielded do not comply with that standard. To think through an ethics of conduct under the bronze standard, I invoke the political philosopher John Rawls’s method of developing moral insights from behind a “veil of ignorance,” where people do not know whether they are part of a research study and, as a consequence, have no information about the costs, benefits, or risks involved in participation.7 Drawing on Rawls’s thought experiment, I suggest that prior to seeking funding or IRB approval for research projects, scholars should imagine themselves to be behind a veil of ignorance wherein they do not know, ex ante, whether they are the researcher or the research subject. The ethical choice would be to only engage in research projects in which a reasonable person, who was unsure of whether they would be the researcher or the researched, would consent to participate. 6 Whitfield (2019, p. 4) argues that biomedical ethics may not provide an ethics for political science because biomedical ethics is more concerned with individuals, whereas a political ethics might be more concerned with groups. 7 Rawls (2009 [1971], 118ff).

134

Dawn Langan Teele

If it is impossible to imagine a reasonable person consenting to being on the subject side, then clearly the research is not ethical and should not be carried out. Importantly, though, we cannot think solely about the abstract “reasonable” person as being someone who may not share our own personal preferences or incentives. It is not enough to ask, “Would I, with my priorities and values, consent to this?” Rather, we must think about, and inquire into, whether other people, with values, priorities, and positionalities that are different from our own, might consent to the project.

7.1 Standards of Consent Most university review boards and governmental associations make reference to the Belmont Report in their guides to ethical conduct in research with human participants. The Belmont Report emerged after the US Department of Health, Education, and Welfare formed a National Commission for the Protection of Human Subjects in Biomedical and Behavioral Research in the 1970s. The three principles that the report enumerates for the protection of human subjects are “respect for persons,” “beneficence,” and “justice.” These correspond roughly to the ideas that: (1) people’s autonomy must be respected, and vulnerable groups protected, by articulating the risks involved in research and procuring informed consent; (2) researchers should not subject participants to more than necessary risks without an immediate prospect of direct benefit; and (3) the burdens of research should be taken on by populations that will benefit most from the findings. In other words, the poor cannot be research subjects simply because they are poor or because it reduces the costs of the research; rather, they need to be potential beneficiaries. Although my initial concerns about the ethics of experiments had to do with research on marginalized populations, such as those living in poverty in poor countries

(Teele 2014), most of the ethical discussion extending out of the Belmont Report has centered around the issue of deception and what it means to be “informed” enough to consent to participate in research.8 Due to a phenomenon known as the Hawthorne Effect – also known as the “observer effect” or “experimental demand effects” – research participants may change their behavior if they have detailed knowledge about a study’s aims or if they know for certain whether they are in the placebo group or the group that received the active medication. Thus, scientists of both social and medical varieties have raised questions such as: Under what conditions is it acceptable to deceive participants about the true aims of the study? Do participants have a right to be debriefed after? And how much do we need to communicate about risk to ethically carry out research? Perhaps because the Belmont principles are fairly clear about the imperative of consent, there has been much less discussion of the conditions under which consent is required. Until now, consent has been a core tenet even if, in practice, scholars violated this principle frequently. Table 7.1 provides three ways to classify experiments in terms of the type of participation they rely on and the quality and amount of information that participants are given about the study’s aims. The gold standard, implied by the Belmont Report, requires that participation is voluntary and information about the treatments involved and the potential risks are articulated as clearly as possible to the research participants.9 Goldstandard projects do not require that the risks be minimal, but instead that the risks are adequately communicated to voluntary participants.10 The silver standard emerges 8 On informed consent, see: Thorne (1980) and Gray (1978), and on deception, see: Baumrind (1985); Bonetti (1998), and Geller (1982), as well as Chapters 27 and 28 in, this volume. 9 Voluntary participation may not always be so cut and dry. As Klitzman (2015, p. 90) shows through interviews with IRB members, the fact that medical research often pays people to participate in trials can, in some contexts, impel people to participate solely for the monetary gain. This is more coercive than pure voluntary participation. 10 These categories are not consequentialist. They prize voluntary participation and full information

Virtual Consent

135

Table 7.1 Three standards of ethics and experiments. Participation

Information

Examples

Gold standard

Voluntary

Full

Silver standard

Voluntary

Limited

Bronze standard

Involuntary

None

Research where the purpose of the research is communicated to participants. Medical experiments with placebos; some lab experiments with games; psychological experiments without knowledge of the manipulation; survey research with experiments embedded. Audit experiments; informational field experiments; canvassing experiments; radio or television experiments; experiments with randomization at a higher level of aggregation; social media experiments; observational research with administrative data on the deceased.

when participation in research is voluntary, but participants are not given complete information either about the study’s aims or about the nature of their placement within the study. Because participants in biomedical experiments consent to participate in research but typically do not know whether they are receiving the treatment or a placebo, the silver standard is the de facto norm in randomized controlled trials in biomedicine. The silver standard is also common in most social scientific experiments that emerge from laboratory- or survey-based research. In these settings, people participating in laboratory or survey experiments do so voluntarily, but they may not know the full aims of the research, and they may not know whether they are seeing questions that count as experimental treatments or controls. The silver standard is also common in interviewbased research and in participant observation: participation is voluntary, but the researcher may not outline the full aims of the study. In each of these examples, the participants regardless of the risk. As I argue in Teele (2014), the risk in social experiments is often not solely at the individual level and may not be understood at the outset. Thus, arguing that a project poses no risk and therefore should not be subject to oversight is not compelling. Under the veil of ignorance, which is useful for thinking about ethics under the bronze standard, researchers would almost certainly think about the risks involved.

themselves have agreed to suspend the full information condition so that they can participate in the research.11 The third row of Table 7.1 outlines the bronze standard for research ethics. Here, both voluntary participation and full information have been suspended, producing a condition in which the researcher provides virtual consent for the participant. A considerable number of experimental designs in political science are in the bronze-standard realm: audit experiments, like the bureaucrat– citizen study described in the introduction; correspondence experiments, in which many unknowing and involuntary participants receive emails prompting them to respond in some way and provide information (many of these involve public servants); information field experiments, where scholars 11 In medical ethics, allowing individuals to consent without knowing all of the potential risks and rewards ahead requires “equipoise,” a condition where researchers really do not know whether the experimental drug will be better than the standard of care. In all standard protocols, if a trial drug is shown to be lethal or accompanied by terrible side effects, then the experiment must halt. On the other hand, if the trial drug is shown to be curative, then the experiment must stop and all participants must be administered the life-saving treatment. In social experiments, there is no maxim (nor does there appear to be a budgetary requirement) for halting an experiment and administering to everyone the treatments that work.

136

Dawn Langan Teele

send mailers with information about, for example, candidates for election or polling station locations and thereafter observe the effect on turnout; and canvassing experiments, like get-out-the-vote (GOTV) experiments, where scholars send teams of individuals knock on the doors of potential voters. To be sure, in GOTV experiments, citizens who participate in answering questions to a survey will know they are a part of some study (and they will have to consent to answering questions), but in other studies that do not have a survey component and rely on individual-level public voting records, individuals can be the unit of analysis even if they do not know that information about them is being collected and stored. Virtual consent is also extended in scenarios when the unit of analysis is at a higher level of aggregation, such as a village, a constituency, a television marketing area, etc. Here, individuals do not consent to participating in a study and data that could be linked to individuals are not collected. Although they have not named it “virtual consent,” IRBs have developed ways to think about consent in situations in which the researchers, and not the subject of experiments, provide the consent. By invoking the USA’s Federal Policy for the Protection of Human Subjects under what is known as the “Common Rule,” scholars can carry out research involving human subjects that is exempt from oversight for a few categories of research.12 Under the revised Common Rule (updated in 2018), there are a few types of research for which scholars do not need to procure informed consent: when working in established educational institutions; and when conducting research that uses testing, including educational and aptitude tests, surveys, interviews, or observation of public behavior. The caveat here is that respondents’ identities must be protected, or, if the respondents are identifiable, revelation of the respondents’ identities given their performance would 12 See part A, §46.104 for a discussion of exemptions (www.hhs.gov/ohrp/regulations-and-policy/ regulations/45-cfr-46/index.html).

not harm their reputations or life chances; also exempt is research involving “benign” behavioral interventions carried out using adult subjects that utilize verbal or written responses or that involve recordings, so long as subjects’ identities cannot be ascertained and they would face no risk to their life chances upon disclosure.13 The final type of exempt investigation is secondary research when identifiable private information is publicly available. The idea behind suspending individual consent for these classes of research is likely twofold. First, it cuts down on paperwork for institutions like IRBs, which need only verify that the exemption should stand. Second, the relatively low risk stemming from what the government perceives to be “benign” behavioral interventions informs the judgment that participation will bring no harm. Note, however, that if a key component of the research involves deceiving the participant about the purpose of the research, the Common Rule requires that informed consent must be obtained prospectively. In other words, the Common Rule exempts from informed consent research that is somewhere between the gold and silver standards described above provided that the research intervention is benign. But if there is deception, subjects have a right to understand the purpose of the research and the fact that they may be deceived, and they must prospectively consent to participate. Finally, the Common Rule allows government agencies to study their own institutions to learn about efficiencies, service delivery, and the like without procuring informed consent, but all research on government agencies has

13 §46.104.3.ii: “For the purpose of this provision, benign behavioral interventions are brief in duration, harmless, painless, not physically invasive, not likely to have a significant adverse lasting impact on the subjects, and the investigator has no reason to think the subjects will find the interventions offensive or embarrassing. Provided all such criteria are met, examples of such benign behavioral interventions would include having the subjects play an online game, having them solve puzzles under various noise conditions, or having them decide how to allocate a nominal amount of received cash between themselves and someone else.”

Virtual Consent

to be registered in an online database prior to the beginning of the study. To the degree that individual-level information that is identifiable is recorded as part of a study, the US government’s policy requires informed consent. To the degree that research interventions are not “benign,” the Common Rule requires informed consent. These features of US policy are important because they suggest that while there are cases where consent is not necessary, the policies do not allow individuals, including elected leaders, to consent virtually for others. Yet, a question that is often raised, especially in the context of developing countries, is whether public officials, in their capacities as representatives, might consent to the participation of the citizens whom they represent. For many experimenters working in the Global South, this issue is key, as in some settings the questions they seek to answer are related to group outcomes instead of individual outcomes. If the randomization takes place at a higher level of aggregation (e.g., the village level), then the outcomes need to be measured at that level to produce internally consistent estimates. In this case, experimenters might hope that the village council-person, or a subnational representative, could virtually consent for their citizens. If we were to apply the Common Rule to these research settings, individual-level data must not be recorded if individual-level consent was not procured. The thought experiment that I outline below, which imagines that a researcher is contemplating whether she or he can virtually consent for research subjects, is useful for thinking about virtual consent when granted by representatives or supervisors. It bears emphasizing that very little work with human subjects likely adheres to the gold standard. Whether a researcher is conducting interviews or an online survey or running an experiment, there are justifications, like the Hawthorne Effect or threats of social desirability bias, to hold one’s hypotheses and pet theories close to one’s chest. In addition, ethnographic scholars may not actually have full information about the potential risks of interacting with the community (both for

137

themselves and their subjects).14 Ultimately, scholars have to decide how much information to reveal and whether to press subjects into research without their knowledge. Projects that are operating under the bronze standard are clearly the most concerning, but the fact that they are common suggests that individual researchers have wide latitude in which to make the case to IRBs that they do not need to procure informed consent. When researchers argue that they do not need to procure informed consent from every research subject, their claims often rest on the idea that the risks from participating in the research are low, and provided that they work with non-vulnerable populations, public interest in the results of the experiment supersede the need for consent. Yet since most research agendas are not subject to any scrutiny by the public (nor do I believe they all necessarily should be), it is hard to know how public interest is determined in these cases. For example, while most political scientists may agree that public participation in elections is key to a vibrant democracy, does the public at large believe this to be true? Individual voters may not perceive turnout experiments to be in their interest, especially if turnout increases participation among voters with whom they disagree. In addition, even in experiments where political scientists agree that the end goal is more or less desirable, individual- and communitylevel risks may be hard to comprehend. For example, if turnout studies end up increasing participation by those already more likely to vote (Arceneaux and Nickerson 2009), they may inadvertently increase the relative voice of rich citizens vis-à-vis poor citizens. And even “benign” lab interventions where individual participants do consent may cause strife in the village and lead to accusations against researchers (Dionne et al. 2015).15 14 Bosk and De Vries (2004, p. 253). On ethical dilemmas after a researcher leaves the field, see Knott (2019). 15 Dionne et al. (2015) provide several examples of consequences (i.e., risks) that were unforeseen in their research in Africa. For example, when only some people were randomly selected to receive more medical treatment, those who were selected refused to participate; when some village members were

138

Dawn Langan Teele

The study may not pose particular risks to individual people in an easily imaginable or direct way, but the downstream consequences at the community level may be substantial (see Teele 2014). Ultimately, my claim is that an evaluation of individual-level risks does not provide a satisfying way of contemplating whether a particular research agenda is ethical. Insofar as we exist in a world where individual informed consent is routinely circumvented, we need to have a way to think about the conditions under which extending virtual consent might be ethical.

7.2 Virtual Consent behind the Veil of Ignorance To conceptualize the ethics of virtual consent – a condition under which the researcher consents for others – we have to think abstractly about the preferences that other people in the world would hold about particular research projects. To do this, I suggest that we draw on Rawls’s thought experiment of choice behind a veil of ignorance. Rawls was interested in distributive justice: that is, articulating a moral principle for the division of resources in society. In A Theory of Justice, Rawls writes: The idea of the original position is to set up a fair procedure so that any principles agreed to will be just. The aim is to use the notion of pure procedural justice as a basis of theory. Somehow we must nullify the effects of specific contingencies which put men at odds and tempt them to exploit social and natural circumstances to their own advantage. Now in order to do this I assume that the parties are situated behind a veil of ignorance. They do not know how the various alternatives will affect their own particular case and they are obliged to evaluate principles solely on the

randomly chosen to participate in lab experiments, other villagers spread rumors that the researchers were devils and delayed the research substantially. Dionne et al. have moved toward transparent and participatory open fora to discuss research plans before implementation.

basis of general considerations. (Rawls 2009 [1971], p. 118)

He goes on to argue that a just distribution would be one that would be chosen by a reasonable person (or a person who is capable of having a sense of justice; p. 125) prior to that person knowing the type of house he or she would be born into or the distribution of resources in that society.16 I believe this type of reasoning can also provide insight into experimental ethics. Imagine that you are a reasonable person who is going to make a decision for a world that you do not know much about. You are provided with information about a proposal for a scientific experiment and are given only limited information about the world in which the experiment will be carried out. For example, you do not know how many people are poor and how many people are rich; you do not know if women are equal to men (or if there are genders at all); and you do not know if ascriptive characteristics like race segregate people’s life chances. You also do not know the likelihood that you will be born into any particular group. Instead, all you know is that if you approve the project there is some chance that you will be a part of the experiment and another chance that you will be the scientist. In the absence of complete information about the world or about which side of the clipboard you will get to see, you have to choose whether the experiment should be carried out. Most people in this situation would engage the question of whether they, their families, or their communities would be harmed in a tangible way if the study were implemented. Many might also go a bit further to think of how other people, with different interests and preferences, might regard the relative risks and rewards of participation. A step further 16 In fact, in his early work, Rawls talks about a “rational” person. But given how rationality has historically had gendered connotations, I prefer “reasonable.” Rawls himself develops a concept of “reasonable” later in his career, which came under a lot of scrutiny by political theorists. I do not seek to intervene in that debate, but rather to suggest that the thought experiment is useful for an individual researcher seeking to reflect on whether a proposed project is ethical.

Virtual Consent

would be to discuss with others whether the intervention being proposed sounds alarm bells. And finally, the step yet further (which would push the project up into the silverstandard territory) would be to engage the actual people whose lives might be altered in discussions about the project. The ethical experiment is one that, uncertain about whether one would be the research subject or the experimenter, one would nevertheless consent to the study’s implementation.17

139

To illustrate the utility of this thought experiment, let’s use an example from my own work, as it is an example of a very common research design – implemented more than 50 times now (Costa 2019) – and one whose ethics I debated for a long time. In Kalla et al. (2017), my coauthors and I designed a correspondence experiment meant to help us understand whether local American public officials were more encouraging to young men interested in careers in politics than to women with similar aspirations. Inspired by a large literature in American politics that suggests that US women demonstrate lower levels of political ambition, we wondered whether this lower level of ambition might be due to different signals that young men and women received early on in their professional journeys. To answer this, we needed to understand whether men and women really do receive different signals. Our correspondence experiment utilized a large database of public officials’ email addresses, to which we sent emails from fictitious high school students either asking for help on a class project or requesting information about pursuing a career in politics. Our primary outcome measures included whether the fictitious student received any response from the real public officials, whether the response was encouraging or discouraging, and the type of language used to communicate with the fictitious student. We found that female

students were actually more likely to receive a response from local public officials than male students, and that the group least likely to receive a response was men with Latinx last names. When examining the ethics of Kalla et al. (2017), there are several issues to consider: the exploitation of vulnerable groups; risk to subjects and community; deception; and consent. While our experiment did not place undue burden on vulnerable groups to be the subjects of study and posed minimal risk to the subjects, it did involve deception and did not procure informed consent. In the online appendix, we wrote, “It is our belief that deception and a lack of informed consent are ethically problematic when experiments are carried out on vulnerable populations, when they carry risk to the participant, and when they have potential community-level or downstream consequences after the experiment is completed. An intervention of the sort described here, which asks elite leaders to help a student with a class project, and to engage in communication that is on-par with the types of things that these leaders do every day (i.e. answer emails) does not evince these concerns.”18 The logic we used in making the assessment was related to the Common Rule – that the research is ethical because it is in the public interest to understand the behavior of specific groups of subjects, like legislators, and that it meets the common rule standard of deception without the risk of harm. Suppose instead that we consider the ethics of this correspondence study behind the veil of ignorance. What would a reasonable person say about whether the experiment should be carried out? On the one hand, the person might intuit that understanding whether women are treated equally to men is an important question (whether they believe women are, or are not, discriminated against in public life), and the person might surmise that the answer is important enough to warrant taking up

17 See Frolich and Oppenheimer (1992) for a rigorous discussion of the effectiveness and limits of the original position as a method for achieving impartial reasoning about moral questions.

18 The online appendix is available here: http://dx.doi .org/10.1086/693984.

7.2.1 An Example

140

Dawn Langan Teele

some small amount of public officials’ time.19 There are probably no potential physical harms that could come of the experiment, and there are probably no individual downstream consequences for that person (especially if the data are recorded in a completely anonymous fashion). So in terms of the potential rewards, we could say that there are some, because we stand to learn things that may not be observable through other means, and in terms of potential risks, we could say there are not many. Thus, a reasonable person might agree that this research is ethical. However, what if, instead of solely thinking about just this experiment, the person were to raise other concerns, such as should we be wasting politicians’ time? What will happen if the politicians find out that their (extremely thoughtful) responses were merely used as a row in some researcher’s Excel spreadsheet? Would those politicians think again about answering constituent emails in the future? If so, how does this impact the quality of representation and the efficacy of democracy? Several political scientists have discussed the issue of wasting public officials’ time, and Slough (2018, described in Chapter 27 of this volume) actually attempted to measure the total public money that would be spent on her intervention, determining that the amount was relatively small. Yet unforeseen outcomes remain a possibility, such as when a large audit study by a Yale law professor to all county clerks in Colorado (and in several other states) just prior to an election caused alarm within the state’s Internet technology services and required hundreds of man-hours, just prior to an election, to verify that it was not electoral interference (see Chapter 27 in this volume). In addition, a reasonable person might wonder whether, if the experiment is carried out, its publication will encourage other scholars to pursue similar research designs and to email either that same set of politicians or another similar group. The person might 19 Many political scientists argue for the importance of these types of experiments. See McClendon (2012) as well as Chapter 8 in this volume, who encourage us to be bolder in our experiments on elites.

wonder whether this is asking too much, or at least engage the question of how many such projects could be justifiable. McClendon (2012) describes this as the potential to “spoil the pool” and Grose (Chapter 8 in this volume) suggests that people should utilize pre-registry to make sure that not too many experiments are carried out on the same group of elites. But even with a massive coordination effort, there is a concern about what happens to the quality of representation if either the majority of emails that politicians receive are from fake citizens or politicians begin to believe that the majority of emails they receive are for research.20 A key concern is that the proliferation of audit experiments will influence not only whether politicians are willing to engage with supposed citizens, but also that they will inform the set of issues that those politicians believe are important to their citizens.21 Finally, perhaps the reasonable person would consider the issue raised by Shapiro (2014): namely, that if we agree to fund this particular experiment, we may not be able to fund other research. They may be concerned that the amount of information that we learn from this one particular study might be less valuable than a comprehensive evaluation of gender bias using other methods. As we can see, when viewed in this way, the ethics of the correspondence study that I carried out become more problematic. An innovative paper by Desposato (2018) speaks directly to the issue of how people evaluate the ethics of research directly. In what he calls an exercise in “empirical ethics” (p. 740), Desposato surveys both a random sample of the US population (n = 3000 from 20 Naurin and Ohberg (2019) surveyed Swedish politicians about their own participation in research and argue that politicians are more concerned with being burdened by lengthy surveys and by not having the results circulated than with participating. 21 This line of critique is similar to that raised in Whitfield (2019, p. 6), which suggests that all American citizens are the “principals” – the key stakeholders to whom representatives are accountable – in this type of correspondence experiment. If politicians become less likely to respond to all emails after learning that some emails might be fictitious, then this reduces representation and responsiveness for all citizens, and hence would be unethical.

Virtual Consent

Survey Sampling International (SSI)) and a convenience sample of political scientists in the American Political Science Association (n = 1600, response rate = 11.25%). The surveys present several vignette experiments that ask respondents about the ethics of specific research designs. The first vignette focuses on the ethics of informational studies (like mailer experiments), while the second focuses on correspondence experiments (like Kalla et al. 2017). Each vignette randomly varied factors like how large the experimental sample would be, the exact nature of the mailer or informational request, and, most importantly, whether the subjects would have an opportunity to consent to be in the study. Desposato’s figure 1 shows that, across the board, the American public and scholars are less willing to support research that does not procure informed consent, and that consent is deemed important even in research that has potentially greater downstream consequences, such as an experiment where citizens are informed of a politicians’ previous DUI. Although scholars and the public had similar rates of acceptability for controversial designs under the condition of consent, scholars penalized research without consent more than the public (figure 2). Without more evidence, we cannot say whether the penalties that are put on research without informed consent come from nonexperimentalists’ desire to gatekeep experimental research or from scholars’ concerns about individual rights. Even under the condition of consent, and even with a trivial public health mailer experiment (where people are reminded to floss), about 15% of public respondents said that they would not want to participate in the research. Whether this is because they would rather not have another mailer sent their way or because of an ethical concern is unclear. But it is telling that the rate of desired nonparticipation increases in the controversial condition with informed consent, where more than 30% said that they would rather not participate. When not given the opportunity to consent, far more say that they do not want to participate (30% in the trivial treatment and 48% in

141

the controversial treatment).22 What this exercise suggests is that not only does the perceived acceptability of research depend on whether consent is procured, but also that people may have preferences about participating in research even when given the opportunity to consent. These types of preferences would likely influence the conclusions that reasonable people would reach behind the veil of ignorance and render many research designs unethical.

7.3 The Generalized versus the Concrete Other Over the years, there have been many criticisms of Rawls’s thought experiment, and though I do not have space to articulate them all here, one that has particularly resonated with many feminists is the idea that the rational individual that Rawls and his followers draw on comes with an unreflexively masculine and abstract point of view. That is, moral philosophers have often considered ethical principles from their own perspectives, assuming that they can stand in for an abstract individual. Feminist philosophers (such as Benhabib 1992, ch. 5) argue that the concept of the generalized other or the rational person in Rawls is disembodied, and that it does not require theorists to come face-to-face with the issues faced by “concrete others.” That is, because it robs its users of knowledge of the world, the thought experiment does not force the thinkers to interact with categories of people, or specific people, who may reach different conclusions about the ethics of particular resource distributions or of research designs (Benhabib 1992, p. 167).

22 In 16 studies, Meyer et al. (2019) document what they call the A/B illusion: “people appeared to judge a randomized experiment comparing two unobjectionable policies or treatments (A and B), neither of which was known to be superior, as less appropriate than simply implementing either A or B for everyone.” Consent was usually lacking in both the experimental treatments and in the conditions when the policy was implemented universally.

142

Dawn Langan Teele

What I take this to imply, for our purposes, is that we have to consider what people with different values from our own or different life experiences might choose behind the veil of ignorance. Although it is not possible to consider what every single person or group of people would choose, we might start by thinking about how someone who did not come from money, someone who has very limited time, or someone who is socially marginalized might think of the project. The results in Desposato (2018) can also provide insight into this issue, as he finds concrete evidence that political scientists who have previously carried out experiments found the experimental designs much more acceptable than the other scholars. Moreover, scholars of American politics found more of the designs acceptable and theorists less acceptable than the referent group of international relations scholars. Finally, women and older scholars found them less acceptable (p. 745). There is evidence, in other words, that people who are either more marginalized in the field of political science (like women) or people who are more concerned with normative issues in politics (like theorists) are more skeptical about the ethics of experiments, and in particular with the suspension of consent. Even if the nonexperimentalist respondents to Desposato’s study were particularly hostile to experiments, since half of the scholars that responded to his survey utilized experiments in their own work and experimental research does not, even now, make up half of all political science research, it is hard to say that the sample of respondents was overwhelmingly biased against experiments. In making up our own minds about the ethics of particular research, we need not weight all other ideas equally. The fact that different people can come to distinctive conclusions about the ethics of particular research agendas suggests that questions about ethics should be forefront in our discipline. The people and communities whose livelihoods, leadership, and family lives may be influenced by social scientific research deserve a chance to deliberate about the contours of our research.

7.4 Counterarguments These ideas are perhaps going to be unpopular. Why should we limit the range of our scholarship simply because some people might think our projects are unethical? If, on average, people are willing to participate in research even in the absence of informed consent, then surely this gives us the green light. Here are a few potential counterarguments to the one I have made above. 7.4.1 First-Mover Rewards In a policy world that is influenced by scientific discovery, where, for example, World Bank projects and Gates Foundation grants are allocated to ideas that are found to “work,” the types of experiments that political scientists are carrying out are going to be done, so why not let our discipline reap the firstmover rewards? Supposing that it is genuinely true that the research would be carried out with or without a political scientist, is this a good enough reason to become involved in the study? In the example of Three Identical Strangers, the Columbia University psychologists discovered that the adoption agency was already splitting up twins and siblings. They claimed to only be piggybacking their study on top of the standard protocol (Hoffman and Oppenheim 2019a, 2019b). Not only the adoptees but also the adoptive parents felt betrayed by this system. If they had known their child had siblings, many would have adopted the others in the family as well. Thus, just because some program is already being implemented by, say, a local nongovernmental organization (NGO), this does not mean that political scientists should get involved. We should not be complicit in a race to the unscrupulous bottom. 7.4.2 The Unethics of Not Knowing Some might counter that there are instances in which the research question is of such import that not carrying it out is itself a form of damage to society writ large. Correct measurement of the world’s atrocities may be a moral good in itself. Yet let’s not

Virtual Consent

simplify the importance of measurement as opposed to other concerns. Recently, I heard of a complaint raised among development program officers about the ethics of experimental techniques for eliciting answers to sensitive topics. In research on intimate partner violence, the use of list experiments to recover accurate information about violence in the home may preclude social service providers from knowing which women were at the greatest risk. Hence, experimentally elicited measurement of a phenomenon at the population level may undercut interventions to help the most vulnerable women. 7.4.3 No Big Deal Some might counter that the questions are small enough, the findings trivial enough, and the downstream consequences insignificant enough that none of this matters. But in this case I say: Why on Earth are we wasting precious research dollars on such “cuteonomics?” There are so many deserving projects, surely something that is small, even if well identified, does not need to be researched, nor does it deserve research funding. In clearly important areas like the reduction of poverty, there may also be examples of how money that would go to research could be better spent on actually reducing poverty. If it is true that experiments are the best way to learn about the world (Green and Gerber 2014), then surely in order to be able to argue that the key beneficiaries of the research are not the academic researcher, but the people whose lives are most afflicted, project budgets need to include money not only for the execution of the experiment, but also for implementing whatever programs are deemed to work best for reducing hardship in others’ lives. In medical experiments, if a treatment is discovered to far outperform the placebo or older standard of care, the trial must stop and all participants must be given the new medicine. If experimentalists believe that money is wasted on nonexperimental work, then they must be willing to pony up money to alleviate social problems when they discover what works from their experimental protocols.

143

7.4.4 A Professional Ethics Plea Finally, there are arguments circulating that we should not do things that will look bad for our profession (Humphreys 2015). While this is undoubtedly true, it is insufficient as a guide to ethical research. If all we care about are professional ethics, there are plenty of people who might think that the rewards from publications will be worth risking nondetection. Since people who are poor and marginalized often have few resources to advocate for themselves, and their governments may not be very involved in protecting their rights, a concern with merely professional ethics increases the likelihood that scholars will exploit marginalized groups.

7.5 Conclusion I have argued that we need a new ethical framework for considering whether and when it is acceptable to carry out social scientific experiments on subject populations that do not consent. Although there are university boards dedicated to passing rules and overseeing decisions about what research can be carried out under their purview, we cannot rely on IRBs – on the mandatory training they provide, the forms they require us to complete, or the judgments they pass – to help us understand how to behave ethically toward the humans that we recruit for our research. Instead, I suggest that one approach is to adapt a Rawlsian thought experiment where, by reasoning from behind a veil of ignorance, we can contemplate the ethics of a study. Virtual consent – a situation in which the researcher consents for the subject population – could only be ethical if, behind a veil of ignorance, reasonable people would agree that the research should be conducted. Importantly, the reasonable people cannot solely be stand-ins for the researcher, but have to be conceived of in terms of people with multiple interests, preferences, and positionalities. Ultimately, I am skeptical about whether it is possible to ethically extend virtual consent for most forms of experimental research.

144

Dawn Langan Teele

Returning to the introduction’s example of the orphaned triplet experiment from Three Identical Strangers, what would happen if, behind the veil of ignorance, a reasonable person did not know whether they would be one of the siblings separated from their kin for the entirety of their lives or whether they would be the research scientists whose careers would be made famous by the findings? To be sure, many people are compelled to debate the relative weight of nature versus nurture in understanding human behavior, and some people have very high tolerance for risk that might make the mere possibility of attaining the researchers’ fame and fortune seem worth it, but it is hard to imagine that most people would consent to the experiment. Faced with a life lived among (even very loving) adoptive families filled with no biological relatives even when those relatives do in fact exist, most people would choose not to participate and would probably not choose participation virtually for others. This, I believe, is a good indication that the study is not ethical. In a series of articles in the Journal of the American Medical Association, the scientists that were involved in separating the twins and triplets in Three Identical Strangers came to their own defense. Their argument: the separation of twins and siblings was a common practice, and they merely took advantage of the opportunity that was presented to them. This justification – that the policy was already in place and scientists merely studied the results – cannot serve as a basis for either a professional or an individual ethic. It is not enough to claim that the implementing agencies (like an NGO or an international NGO) are ultimately responsible for the ethics of projects (Humphreys 2012). As scientific leaders, research scholars have both a professional and a moral obligation to leave opportunities on the table if they are emerge on dubious grounds. Imagining ourselves to be ordinary people, interacting with people who are directly involved in our research, and thinking through the concerns of others can go a long way in helping us to determine which research can be ethically pursued. From this vantage, I think it is fair to say that accurate

measurement of any social phenomenon, including causal effects, probably would not, in itself, play an outsized role in determining the ethics of research behind the veil of ignorance. While scientific visionaries may place an inordinate amount of weight on measuring outcomes precisely, there are many circumstances in which accurate measurement would require practices that are difficult to justify and to which most people would not agree. In addition, arguments about the value added of particular social science research are likely to overstate the importance of any research agenda. Thus, I maintain the argument that I made earlier: that there is not an ethical trade-off between respect for persons and measurement (Teele 2014). I suspected then (and continue to surmise) that the desire to conceive of measurement as an ethics issue is related to a Panglossian view of the import of one’s own research for society writ large, as well as professional imperatives to publish. These two forces can create incentives that diminish the dignity of other people and undermine our integrity as scholars. Reasoning from behind the veil of ignorance may help to combat some of these perverse incentives.

References Arceneaux, Kevin, and David W. Nickerson. 2009. “Who Is Mobilized to Vote? A Re-analysis of 11 Field Experiments.” American Journal of Political Science 53(1): 1–16. Baumrind, Diana. 1985. “Research Using Intentional Deception: Ethical Issues Revisited.” American Psychologist 40(2): 165. Benhabib, Seyla. 1992. Situating the Self: Gender, Community, and Postmodernism in Contemporary Ethics. Hove: Psychology Press. Bonetti, Shane. 1998. “Experimental Economics and Deception.” Journal of Economic Psychology 19(3): 377–395. Bosk, Charles L., and Raymond G. De Vries. 2004. “Bureaucracies of Mass Deception: Institutional Review Boards and the Ethics of Ethnographic Research.” Annals of the American Academy of Political and Social Science 595: 249–263. Connors, Elizabeth C., Yanna Krupnikov, and John Barry Ryan. 2019. “Does Transparency

Virtual Consent Bias Survey Research?” Public Opinion Quarterly 83: 185–209. Costa, Mia. 2017. “How Responsive Are Political Elites? A Meta-Analysis of Experiments on Public Officials.” Journal of Experimental Political Science 4(3): 241–254. Desposato, Scott, ed. 2015. Ethics and Experiments: Problems and Solutions for Social Scientists and Policy Professionals. Abingdon: Routledge. Desposato, Scott. 2018. “Subjects and Scholars’ Views on the Ethics of Political Science Field Experiments.” Perspectives on Politics 16(3): 739–750. Dionne, Kim Yi, Augustine Harawa, and Hastings Honde. 2015. “The Ethics of Exclusion When Experimeting in Impoverished Settings.” In Ethics and Experiments: Problems and Solutions for Social Scientists and Policy Professionals, ed. Scott Desposato. Abingdon: Routledge, pp. 39–55. Geller, Daniel M. 1982. “Alternatives to Deception: Why, What, and How?” In The Ethics of Social Research: Surveys and Experiments, ed. Joan E. Sieber. New York: Springer, pp. 39–55. Gray, Bradford H. 1978. “Complexities of Informed Consent.” Annals of the American Academy of Political and Social Science 437: 37–48. Hoffman, Leon, and Lois Oppenheim. 2019a. “Historical Practice of Separating Twins at Birth—Reply.” JAMA 322(18): 1827–1828. Hoffman, Leon, and Lois Oppenheim. 2019b. “Three Identical Strangers and the Twinning Reaction – Clarifying History and Lessons for Today from Peter Neubauer’s Twins Study.” JAMA 33(1): 10–12. Humphreys, Macartan. 2015. “Reflections on the Ethics of Social.” Journal of Globalization and Development 6(1): 87–112. Humphreys, Macartan, and Weinstein, Jeremy M. 2009. “Field Experiments and the Political Economy of Development.” Annual Review of Political Science 12(1): 367–378. Jones, James. 1981. Bad Blood: The Tuskegee Syphilis Experiment. New York: Free Press. Kalla, Joshua, Frances Rosenbluth, and Dawn Teele. 2018. “Are You My Mentor? A Field Experiment on Gender, Ethnicity, and Political Self Starters.” Journal of Politics 80(1): 337–341. King, Gary, and Melissa Sands. 2015. “How Human Subjects Research Rules Mislead You and Your University, and What to Do About it.” URL: https://gking.harvard.edu/ files/gking/files/irb_politics_paper_1.pdf Klitzman, Robert. 2015. The Ethics Police? The Struggle to Make Human Research Safe. Oxford: Oxford University Press.

145

Knott, Eleanor. 2019. “Beyond the Field: Ethics after Fieldwork in Politically Dynamic Contexts.” Perspectives on Politics 17(1): 140–153. Levitt, Steven, and John List. 2009. “Field Experiments in Economics: The Past, the Present, and the Future.” European Economic Review 53(1): 1–18. Meyer, Michelle N., Patrick R. Heck, Geoffrey S. Holtzman, Stephen M. Anderson, William Cai, Duncan J. Watts, and Christopher F. Chabris. 2019. “Objecting to Experiments that Compare Two Unobjectionable Policies or Treatments.” Proceedings of the National Academy of Sciences of the United States of America 116(22): 10723–10728. Milkman, Katherine L., Modupe Akinola, and Dolly Chugh. 2015. “What Happens Before? A Field Experiment Exploring How Pay and Representation Differentially Shape Bias on the Pathway into Organizations.” Journal of Applied Psychology 100(6): 1678–1712. Naurin, Elin, and Patrik Ohberg. 2019. “Ethics in Elite Experiments: A Perspective of Officials and Voters.” British Journal of Political Science. doi:10.1017/S0007123418000583. Neubauer, Peter B., and Alexander Neubauer. 1996. Nature’s Thumbprint: The New Genetics of Personality. New York: Columbia University Press. Rawls, John. 2009 [1971]. A Theory of Justice. Cambridge, MA: Harvard University Press. Shapiro, Ian. 2014. “Methods Are Like People: If You Focus Only on What They Can’t Do, You Will Always Be Disappointed.” In Field Experiments and Their Critics, ed. Dawn Teele. New Haven, CT: Yale University Press, pp. 228–242. Slough, Tara. 2018. “Bureaucrats Driving Inequality in Access: Experimental Evidence from Colombia.” URL: http://taraslough.com/ assets/pdf/JMP.pdf Teele, Dawn. 2014. “Reflections on the Ethics of Field Experiments.” In Field Experiments and Their Critics, ed. Dawn Teele. New Haven, CT: Yale University Press, pp. 115–140. Thorne, Barrie. 1980. ’“You Still Takin’ Notes?” Fieldwork and Problems of Informed Consent.” Social Problems 27(3): 284–297. Whitfield, Gregory. 2019. “Toward a Separate Ethics of Political Field Experiments.” Political Research Quarterly 72(3): 527–538. Wickenden, Dorothy. 2019. “The Politics Behind the Anti-Vaccine Movement.” The New Yorker, August 29.

Part II

E X P E R I M E N TA L DATA

CHAPTER 8

Experiments, Political Elites, and Political Institutions∗

Christian R. Grose

Abstract The use of experiments to study the behavior of political elites in institutions has a long history and is once again becoming an active field of research. I review that history, noting that government officials within political institutions frequently use random assignment to test for policy effects and to encourage compliance. Scholars of political institutions have generally been slower than practitioners to embrace the use of experiments, though there has been remarkable growth in experimentation by scholars to study political elites. I summarize the domains in which scholars have most commonly used experiments, commenting on how researchers have seized opportunities to leverage random assignment. I highlight design challenges including limited sample sizes, answering theoretically driven questions while partnering with public officials or others, and the difficulty of conducting replications. I then implore scholars to be bold in using experiments to study political institutions while also being mindful of ethical considerations. The experimental revolution has remade political science and the social sciences, and increasingly scholars are conducting

experiments with political elites and using experiments to study political institutions. While the use of experiments has been

*

revise this chapter. This chapter is dedicated to the memory of Dick Fenno, who was an incredible mentor and scholar who left us in 2020. He taught me the value of scholars carefully and ethically intervening with public officials to learn about political representation and institutions and how to interact and engage with policy practitioners. Rest in peace, Dick Fenno.

I thank Dan Butler, James N. Druckman, Donald P. Green, Indridi Indridason, Joshua Kalla, Diana Mutz, Ariel White, Lynn Vavreck, and participants at the Northwestern experiments conference in spring 2019 for their excellent comments on this chapter specifically or for additional comments that were offered generally at the conference that helped me refine and

149

150

Christian R. Grose

more dominant in scholarship on political behavior, there has been expansive growth in the use of experiments and methods of causal inference when studying political institutions. The study of political institutions has been enhanced by the use of field, survey, and laboratory experiments. I argue that the use of randomization and experiments within political institutions was common even before most political scientists used the method to causally estimate empirical effects. The use of randomization and experiments to study political institutions allows for an understanding of causal relationships in a subfield previously thought of as less amenable to experimentation. I then summarize the areas of research in which scholars have most commonly used experiments to study political institutions, and I note particular contributions in the study of legislatures and political representation. I discuss the benefits and challenges of field experiments, survey experiments, and laboratory experiments to study political institutions. In particular, I discuss three challenges in using experiments to study political elites: (1) limited sample sizes and statistical power; (2) answering theoretically driven questions that are of interest to scholars while partnering with public officials or others in field experiments; and (3) the difficulty of conducting replications of some field experiments in places in which there are few similar institutions. Finally, I encourage scholars studying political institutions using experimental methods to be bold, but to be careful and ethical, when designing such experiments. Given the potential for sample pool depletion and small sample sizes, scholars should move boldly forward in attempting to incorporate experiments for studying institutions where feasible, but the research should be carefully designed with consideration of ethics. Box 8.1 displays the key takeaway points of this chapter for those interested in the use of experiments to study political institutions. I conclude by arguing that experiments are one important method for studying political institutions.

Box 8.1: A guide for scholars using experiments to study political institutions. (1) Be mindful of lower statistical power in experimental studies of institutions. (2) Partner with external nonprofit or political groups to enhance experiments on institutions. (3) Replicate studies. When precise replications are not feasible, pair field experimental evidence with laboratory or survey experimental evidence or pair experimental evidence with observational and descriptive evidence. (4) Be bold, but be careful and ethical; conduct experiments on theoretically meaningful questions. Given the paucity of political institutions where experiments are feasible, choose your questions wisely.

8.1 Randomization and Political Institutions: Together for Longer than You Realize Political institutions are defined as the formal rules, procedures, and organizations – such as legislatures, courts, executive institutions, commissions, and election rules – in which individual or collective behavior occurs or is constrained. Often, though not always, scholars of political institutions examine the decisions and behaviors of political elites within these institutions. The study of political institutions and experiments have been linked to one another for much longer than most scholars of political science realize. In the discipline of political science, the rise of survey and field experiments is attributed to scholars of individual political behavior (Gerber and Green 2014; Iyengar and Kinder 1987; Robison et al. 2018), which typically examines individual actions and opinions that voters or citizens take. For instance, Gerber and

Experiments, Political Elites, and Political Institutions

Green (2000) is often considered the first contemporary field experimental article examining the effect of get-out-the-vote (GOTV) contacts on voter mobilization (Larimer 2018; also see Eldersveld 1956; Gosnell 1927). Chong and Druckman (2007) and others have used survey experiments to understand how framing affects the attitudes of regular people. Following these articles, the amount of behavior research utilizing experiments has increased (Robison et al. 2018). During this surge in the use of experiments to study the behavior and attitudes of the mass public, most scholars of political institutions, in contrast, did not initially adopt experimental methods. Theorists used formal models to specify arguments about political institutions, often leading to comparative static predictions about changes in elite behavior under different institutional arrangements. Empiricists used observational data to answer questions and test theoretical predictions about political institutions, representation, and elites.2 While scholars of political institutions were not as quick to adopt this method nor to embrace causality, practitioners within political institutions used experiments and randomization well before the experimental revolution in political behavior commenced in the 2000s. As Grose and Wood (forthcoming) argue, political institutions themselves have utilized randomization more and earlier than have most scholars of political elites. There are numerous instances of randomization by public officials to distribute limited resources, to test for policy effects, or to carry out random government audits where there are not funds for full audits. For instance, one of the first known instances of a field experiment on members of the US Congress occurred in the 1970s (Eber 2005; Wood and Grose 2019, forthcoming). This experiment was not conducted by a political scientist, but instead was carried out by an auditor at the newly created Federal Election Commission (FEC). As a 2 Though as I note later in this chapter, there were a small number of public choice and institutions scholars conducting laboratory experiments.

151

new method of enforcing campaign finance law, the FEC randomly assigned about 10% of Members of Congress to be audited for compliance with the law. Wood and Grose (forthcoming), examining this natural experiment, find that the random assignment of legislators into audits caused legislators to lose more general election votes than their control group colleagues, yet they find that audited legislators went home to their districts no more frequently than those in the control group. Other instances of randomization conducted by government institutions include the Vietnam draft lottery (Erikson and Stoker 2011); allocation of US congressional offices to new members (Rogowski and Sinclair 2017); allocations of seating of legislators in the Icelandic parliament (Darmofal et al. 2019; Jo and Lowe 2019); random audits of lobbying disclosures by the US Government Accountability Office (Wood and Grose 2019); random draws for which Members of Parliament can submit private bills (Williams and Indridason 2018); lotteries to make committee assignments to legislators (Broockman and Butler 2015; Cirone and Coppenolle 2019; Titiunik 2016; Titiunik and Feher 2018); and the US state of Georgia randomly distributing land shamefully taken from indigenous people to white men (Hall et al. 2019; Weiman 1991). At many levels of courts, there is often random assignment of judges to cases (Hall 2009, 2010; Kastellec 2011; Levy 2017; Sunstein et al. 2006). Scholars have only recently leveraged these randomized experiments where the randomization has been conducted by political institutions. Of course, these are not the only instances where public officials have been subject to government randomization, nor where political elites or public officials used random experiments to evaluate policy or for other reasons. However, I note these examples as public officials have been comfortable with the use of randomization and natural experiments even when scholars of political elites have been slower to adopt this method. For decades, many agencies have conducted randomization as part of their institutional

152

Christian R. Grose

procedures or written randomization and policy evaluation into their laws. Also, this randomization by and of public officials in government even precedes the rise of the widespread contemporary use of field and survey experiments to study the mass public (Chong and Druckman 2007; De Rooij et al. 2009) and the adoption of field and survey experimental techniques to study political elites and institutions (Butler 2019; Grose 2014) by social science scholars in the academy. Given this history of government agencies, political institutions, and political elites using experimentation in the field, it is not surprising that political scientists have also eventually turned to this method to study political institutions and political elites. The scholarly perception that experiments and the study of political institutions are dominant only in the political behavior subfield is wrong. It also suggests that there is room for theoretical and observational work on institutions, but also for continued growth in the use of experiments to study political institutions and political elites. Indeed, I anticipate more multi-method approaches that will utilize experiments while also bringing other nonexperimental data to the research question (e.g., Kriner and Schickler 2014, 2017; Wood and Grose forthcoming). Some of the best work on the study of political institutions – including studies of the US Congress, legislatures, the executive branch, the bureaucracy, public administration, and the courts – uses both experimental methods paired with descriptive data, as not all institutions are amenable to field, survey, or laboratory randomization.

8.2 The Use of Experiments by Scholars to Study Political Institutions and Elites The scholarly study of political institutions has traditionally been within the rational choice approach, and the earliest laboratory experiments on political institutions used this approach (Fiorina and Plott 1978; Riker 1967; Riker and Zavoina 1970). Excepting this early work utilizing laboratory

experiments, until recently, much empirical scholarship of political institutions has typically relied upon observational data for analysis. This tradition has emphasized that theoretical implications are derived from precise models of institutions and individual behavior. Critiques of this tradition have often come from scholars who advocate for experimental work. As a result, many institutions scholars have mistakenly viewed experiments as not being part of the traditions of the subfield, even though the disciplinary leaders of rational choice and political institutions utilized experimental methods to test their predictions (e.g., Riker 1967 is, to my knowledge, one of the earliest experiments published in the American Political Science Review, though preceded by Eldersveld 1956 and Laponce 1966). A group of scholars of political institutions and elites has emerged to show that experiments are compatible with the study of political institutions and elites. Experiments can be designed that test the theoretical predictions of models of political institutions. Many of the first experiments to study political institutions were conducted in the laboratory. In these experiments, laboratory participants – often convenience samples of students – make choices under randomly assigned institutional arrangements (e.g., Eckel et al. 2010; Enemark et al. 2014; Kanthak and Woon 2015; Morton and Williams 2010; Palfrey 2009). These experiments have internal validity and often test predictions of formal models of strategic behavior in institutions. Other work has used actual treatments designed by legislators and other political elites and randomly assigned them to assess public opinion (e.g., Cover and Brumberg 1982) Around the late 2000s, a small body of work began to emerge in the area of legislative representation and decisionmaking in the USA that utilized field experiments. The first field experimental study in a legislature was Bergan (2009), who conducted a field experiment of grassroots lobbying. The experiment was conducted in 2006 in New Hampshire, and a treatment group of legislators were randomly assigned

Experiments, Political Elites, and Political Institutions

to receive contact from regular citizens on policies before the floor of the lower house. The article reports mostly null effects of the randomized citizen contacts to legislators, though in one instance there is a sizable and statistically meaningful effect of citizen contact on roll-call voting by legislators. The first field experiment where members of the US Congress were the subjects was conducted in 2007 with US senators (Grose et al. 2015). This article asks how and why legislators explain their roll-call positions and argues that legislators are likely to offer compensating and reinforcing information to constituents in order to explain their votes (Fenno 1978; Mayhew 1974). These explanations are particularly important when the legislator does not vote in the direction that the constituent wished. An experiment was conducted on US senators where half were initially randomly assigned to receive a position from a constituent favoring immigration reform and the other half were assigned to receive a constituent position against immigration reform. This issue had been subject to cloture votes in this session, providing a public roll-call position for senators that could be compared to their explanations offered to constituents. Actual constituent confederates were used to avoid deception, and a within-subjects design was employed so that all senators received both the pro and the con letters from constituents in two waves. Results showed that Members of Congress were honest and consistent in reporting how they voted to constituent supporters or opponents, but that they explained their votes quite differently when constituents favored immigration reforms in contrast to those constituents who expressed anti-immigration opinions. Additional results showed that these legislator explanations were effective at convincing regular citizens to vote for and support the legislator. Other early and important field experiments on US state legislatures were conducted in 2008 by Butler and Nickerson (2011) and Butler and Broockman (2011). In Butler and Nickerson (2011), state legislators in one state were randomly assigned to learn constituency opinion about an issue.

153

Those legislators randomly assigned to this information treatment were more likely to follow constituency opinion when voting. Butler and Broockman (2011) is the first audit study of legislators where the race of constituents was randomly varied. Results suggest that US state legislators are more responsive to a randomly assigned White constituent than to a randomly assigned African American constituent. Legislators were biased against African American constituents, even though this bias should be reduced or removed due to legislators’ electoral incentives. One of the cleverest field experiments with US House members is Kalla and Broockman (2016). Conducted in 2013 in partnership with an advocacy organization, the authors examined whether being a donor causes more access to elected officials for the donor than regular citizens. The advocacy group randomly assigned whether a donor or nondonor contacted the elected officials. Elected officials were much more likely to meet with donor constituents than with non-donor constituents, causally demonstrating that money can buy access to elected officials. After these early studies, the use of experiments grew in the study of political representation and elites. Audit studies randomly assigned race, ethnicity, gender, immigration status, religion, and other factors to assess whether legislators were differentially responsive to constituents (e.g., Broockman 2014; Bussell 2019; Butler 2014; Gell-Redman et al. 2018; Kalla et al. 2018; Lajevardi 2018; Mendez and Grose 2018). These studies generally showed that legislators in the USA were much less likely to respond to racial and ethnic minorities than to Whites (Costa 2017), and to constituents who lacked immigration status (Dahl 2019; Mendez 2015). Audit studies also revealed important gender differences, with women legislators responding more (Thomsen and Sanders 2019), even when there were no responsiveness differences by gender of constituents. Most of these audit studies were initially conducted in the USA or Europe, but they have expanded to a number of global contexts with political elites. For

154

Christian R. Grose

instance, in India, legislators have been found not to discriminate on religion and to offer few but meaningful responses to constituents (Vaishnav et al. 2019). Bureaucrats and other US public officials also have been found to be less responsive to people of color than to Whites (Coppock 2019; White et al. 2015); slightly less responsive to out-partisans (Porter and Rogowski 2018); surprisingly responsive to LGB people (Lowande and Proctor 2019); and equally responsive to high- and low-social status individuals (Lagunes and Pocasangre 2019). In contrast to what one might hope, these studies suggest that legislators are not equally responsive to constituents, despite electoral incentives to be so. For more on audit studies, see Chapter 3 in this volume; also see Chapter 27 on experiments on street-level bureaucrats. Subsequent to these early legislator field experiments, research utilizing survey experiments where national, state, and local public officials are asked about their attitudes toward policy and politics are also now more common. Naurin and Ohberg (2019) find that public officials in Sweden are amenable to be subjects of experiments, but are also not particularly enthused to be subjects of study. In general, there is reticence for many elected officials, especially in national legislatures, to sit for survey experiments or to partner with scholars. For this reason, survey experiments with subnational elected officials and with subnational bureaucrats are promising (e.g., Anderson et al. 2016; Avellaneda 2013; Druckman and Valdes 2019; Grose and Peterson 2020; Nielsen and Moynihan 2017; Sheffer et al. 2018). When scholars seek to examine how institutions shape or constrain the attitudes of elected officials, survey experiments of local and subnational public officials are particularly useful. Officials’ attitudes can be measured pre- and post-treatment by scholars, and randomized treatments can present institutional and policy scenarios that officials face when governing. For instance, these above-referenced survey experiments have found that elected officials’ attitudes can change when their institutional rules

or private preferences are considered (e.g., Anderson et al. 2016; Druckman and Valdes 2019). Other scholars have found that framing leads elected officials to change their attitudes (Sheffer et al. 2018). This framing effect can even cause elected officials to change their beliefs on a pending policy decision, such as when a racial issue is reframed as an economic issue (Grose and Peterson 2020).

8.3 Challenges in Conducting Experiments with Political Elites There are several unique challenges when conducting experiments on political elites or institutions. I describe three of them below. 8.3.1 Pulling Off Field or Survey Experiments by Partnering with Outside Groups or Public Officials Conducting field or survey experimental research in coordination with political elites or with external nonprofit or political groups is challenging, yet exciting. For most scholars interested in conducting field or survey experiments, it is a good idea to work with elected officials or outside political groups. By embedding experiments within the work of outside organizations – or working with a public official or agency who delivers a treatment – the scholarship has high external validity. It is also ethically ideal to embed experiments within existing political and governmental institutions, instead of having scholars lead experiments that are not realistic and that may interfere with the day-to-day work of those in institutions. It is generally preferable to conduct experiments with political groups and practitioners when studying political institutions. To be done well, scholars must possess a deep understanding of the institutions and people they study, or develop and possess connections with public officials and outside organizations (see Chapter 11 in this volume and Levine 2020). Groups are less willing to partner with scholars who do not share their interests in the real world of policy and politics; or in

Experiments, Political Elites, and Political Institutions

contexts in which there is not trust between scholar and organization. Outside groups often want to harness the power of the gold standard of experimentation in order to evaluate public policy, evaluate their employees, or evaluate the effectiveness of their tactics with elected officials. Scholars must understand the incentives of elected and public officials and/or outside groups in order to successfully partner with them. When scholarly incentives for knowledge generation and practitioner incentives for learning what works in their organization align, field experiments to understand and learn more about political institutions and political elites are much easier to implement. By partnering with outside organizations or practitioners, scholars can address questions – particularly using field experiments – that would not otherwise be possible in randomized interventions conducted only by scholars. For example, scholars would not want to conduct a lobbying experiment of elected officials without partnering with an external organization for ethical and legal reasons. However, an interest group who conducts a lobbying campaign could embed a randomized experiment in the advocacy work in which they regularly engage. Scholars seeking to conduct such experiments should work closely to form relationships and seek feedback from political elites and realworld political groups before conducting a field or survey experiment. This qualitative knowledge and feedback of what political elites hope to learn from experiments can help inform theoretically driven experiments in political science that are feasible (see Chapter 20 in this volume on multi-method approaches). Examples abound of such experiments with elites. While the scholarship utilizing experiments often does not detail the ways in which the experiment came about, it would behoove scholars to explain more about why the outside group agreed to partner to conduct the experiment. For instance, one experiment by Zelizer (2019) involved a partnership with legislators in order to assess cue-taking by legislators and whether this cue-taking was endogenous.

155

In this experiment, the scholar partnered with US state legislators in one state to conduct the experiment. The legislators, presumably, were interested in learning what techniques worked to get their bills passed. Zelizer (2019) embedded a cuetaking experiment within a legislature and found that legislators take cues from their colleagues in their self-selected, endogenous networks. This work provides experimental evidence similar to Kingdon’s (1973) classic descriptive argument (see also Zelizer 2018). In my own experience, one can conduct scholarly led randomized interventions with public officials as subjects and also field experimental partnerships with public officials and organizations. With the exception of audit studies, where constituents are randomly assigned to contact legislators or other elected officials, I have found partnership experiments to be the most challenging yet also the most externally valid and novel. I would recommend this latter approach to most scholars hoping to study political institutions, especially with field experimentation. In one instance, a lobbyist and interest group randomly assigned the types of visits with legislative staff in order to assess the lobbyist’s effectiveness (Grose et al. forthcoming). This aligned with the outside actors’ incentives, and also allowed for us to test a theoretical question about direct lobbying and its effect on legislator position-taking. The outside lobbying firm and interest group was interested in learning more about their tactics, and was also interested in assisting with scholarship. In the article, we argue and find that social lobbying, defined as meetings between lobbyists and public officials that take place in social spaces such as restaurants and coffee shops, are more likely to yield legislative support for interest group-preferred policy. Lobbyist meetings randomly assigned to be held in a social space with legislators yielded more legislative support than meetings held in a legislative office or a control group in which no direct lobbying occurred. In another instance, I have partnered with an institute and a former elected legislator to examine the policy diffusion of some state

156

Christian R. Grose

policies to elected officials in other states (Grose 2019). By partnering with a former elected official, the response rate of elected officials in the survey was likely higher than had I been the only person associated with the survey, and the former legislator’s participation made the treatment embedded in the survey more realistic as it came from a former legislator as opposed to just a scholar of political science. In the study, a survey was disseminated to state legislators across the country who are members of an environmental legislative caucus. This caucus includes members who care about environmental policy and where legislators have chosen to participate. The survey randomly assigned some members of this cross-state environmental policy caucus into a treatment group to receive access to a web portal of summaries and the texts of policies passed in a number of states. The control group did not receive these bills from other states. Then, all state legislators were asked to evaluate and answer questions about their new ideas for state policies on the environment. Results showed that legislators exposed to new policy ideas were more likely to express interest in adopting those policies and to point to those states as sources for where they develop legislation relative to the control group. In addition, the behavioral outcomes of the legislators in the survey were able to be included because of the participation of the former elected official, and we were able to test whether legislators sought help from this former legislator in drafting legislation they learned about and that diffused from other states. These partnerships yield better research, increase external validity, and reduce deception to subjects. One particularly successful partnership was work on encouraging women to run for party offices. Karpowitz et al. (2017) partnered with the Utah Republican party to conduct an encouragement experiment. Their theory argued that both supply and demand explain the paucity of women running for office, and this was tested by varying supply-and-demand messages to rank-and-file delegates of the party. The state

party delivered these messages in randomly assigned precincts and not in others. Here, the political elites were party delegates, and the treatment was also delivered by the party delegates. The outcome was the decision to run for higher-level party positions. The authors find that party leaders’ efforts to increase supply by having more women run and efforts to increase demand for more women officeholders increase the choice of women to run for and hold these elected party positions. To pull off such a partnership, scholars must be willing to branch out into the real world of politics. In some instances, a field experimental idea must be pitched to the outside organization or public officials. Scholars and outside groups benefit from forming agreements and memos of understanding prior to the launching of the experiment, in addition to the usual preregistration of the study and institutional review board submissions. Scholars also need to be willing to pivot from their initial ideas and move toward research ideas that emerge in discussions with practitioners and outside organizations. This is not quite a methodological challenge, but is a theoryto-empirics challenge. Theory should drive the research questions in our work, and thus a well-developed field experiment that connects directly to its theoretical predictions is the ideal. In practice, though, scholars may need to approach several outside groups before there is one willing to conduct the theoretically relevant experiment as a partner. Scholars should not conduct experiments that are theoretically irrelevant or that do not comport with their theoretical predictions. However, scholars must have some flexibility when partnering with outside groups to conduct experiments. Many outside organizations and practitioners are enthusiastic to use experimentation to learn what is effective, and these experiments are successful for all parties when policy evaluations of interest to the practitioners align with the ability to test the theoretical implications of interest to social scientists. In addition, another limitation to all experiments, but perhaps particularly an issue

Experiments, Political Elites, and Political Institutions

with institutions, is that there are strategic and repeated interactions between public officials. Thus, a one-shot experiment may not allow for tests of theories based on repeated interactions with public officials or with officials and outside interest groups. Legislators engage in repeated interactions with one another in voting; lobbyists and interest groups engage in years of developing relationships with elected officials; and bureaucrats engage with one another and other public officials in repeated interactions. Thus, even when scholars are able to partner with organizations, there are constraints on what can be studied. Because people within political institutions communicate a lot with one another, there could be a concern over stable unit treatment value assumption (SUTVA) violations in these types of studies of institutions (if those in the treatment group communicate the treatment to those in the control group; see Zelizer 2019 and Chapter 16 in this volume on SUTVA). This is not always going to be a problem in some studies of institutions, but scholars should consider ways to assess whether communications between subjects could lead to SUTVA violations. 8.3.2 Sample Size, Statistical Power, and Experiments with Political Elites in Institutions One common political institution in which experiments have been used to study political elites is in legislatures. Here, the subjects are often legislators themselves, or the staff of legislators. Most national legislatures across the world range from about 50–700 members. In Germany, the parliament’s lower house has 709 members and the upper house has 69 members. In Japan, the legislature has 465 members in the lower house and 242 members in the upper house, while in Argentina, the lower house has only 257 members and the upper house has 72 members. In the USA, scholars have conducted field experiments in small-to-medium-sized legislatures, such as on the US Senate (Grose et al. 2015) and with US House members

157

as subjects (Kalla and Broockman 2015). In the US states, the largest state legislature is New Hampshire’s lower house with 400 members, and this chamber has been subject to multiple field experiments as a result (e.g., Bergan 2009). Most US state legislatures total about 100–200 legislators, and sometimes fewer than 100 legislators. Other US state legislatures with fewer legislators have also been settings for field experiments of lobbying (Grose et al. 2019) and information (Butler and Nickerson 2011; Zelizer 2018). In studying any legislature, sample size is obviously an issue. Statistical power is thus a concern, and it is an even greater concern in studies involving multiple treatment conditions. Even though there is a move in the behavior literature toward massively scaled GOTV field experiments with very large samples, this is simply not feasible in almost all work where legislators or other actors embedded within political institutions are subjects. The statistical evidence offered in research on political institutions will more often than not have larger standard errors than those that are found in studies based on large-scale voter mobilization field experiments due to sample size. Another strategy is to increase sample size by studying political institutions such as bureaucracies that have larger potential subject pools (e.g., Andersen and Moynihan 2016). Because of low statistical power, scholars should focus on the magnitude of the effect and also take care in interpreting the statistical test (Gill 1999). Scholars must recognize this limitation of statistical power, but should nevertheless continue to study political institutions given the scholarly import of the research questions. In designing experiments with political elites, the number of randomized conditions often needs to be lower given the small sample sizes. Given small sample sizes, scholars should seek meaningful, fairly infrequent, and bold interventions. Scholars of political institutions cannot afford to engage in small interventions or optimizations that may be allowable in the literature on mass behavior where there are nearly unlimited subject pools in vast numbers.

158

Christian R. Grose

When studying political institutions, scholars should not waste a field experimental or survey experimental intervention answering a relatively minor theoretical or substantive question. When interventions are important and impactful, the concerns regarding small sample sizes may be lessened. Scholars must also carefully consider the impacts of their interventions and take great care in their design, given these limited subject pools and small sample sizes. To create meaningful interventions, as mentioned above, working with outside groups in the nonprofit or political realms can assist the experimenter. Not only is the experiment more externally valid, but these groups that scholars partner with can also generate the interventions. Realistic interventions are often meaningful interventions. When interventions are meaningful, the experiment is worth doing, even with a small sample size. 8.3.3 Replication versus Additional Evidence in the Experimental Study of Political Elites Replication is critically important in the study of political behavior and elites, and scholars studying political institutions – especially given small sample sizes – should seek to engage in replications (see Chapter 19 in this volume). The ideal replication strategy is to conduct multiple experiments on different subject populations. This may be an initial field experiment conducted in partnership with an organization and a replication conducted by the scholar. However, it will not always be possible to conduct an exact replication across multiple political institutions. Scholars who have partnered with an organization, for instance, to conduct a field experiment may be unable to persuade the same organization to conduct a second replication experiment in a different context. Advocacy groups may be based in one legislature and do not always have the appetite or resources to conduct replications in another legislature. In some instances, conducting multiple experiments in the same legislature with the same treatment and control

conditions will not be feasible because the legislators will learn over time and become aware that they are being experimented upon in the replications. For experimental studies of political institutions such as legislatures, when precise replications are not feasible, scholars should bring additional and multiple forms of evidence to bear on the theoretical question. In contrast to scholars of political institutions, experimentalists who focus on regular people as subjects can simply draw another survey sample to conduct a replication with identical survey questions asked across multiple experiments. Another GOTV study replication can be conducted as the number of voter subjects will not be depleted. However, in small legislatures, political institutions, and organizations, it may not be feasible to conduct multiple replications. Scholars of political institutions may need to resort to a field experiment supplemented with survey experimental evidence that addresses the same research question – even if the survey experiment is not a precise replication of the field experiment. Alternatively, a field experiment embedded in an institution could be paired with a replication in the laboratory where lab subjects are in institutional arrangements that mirror or reflect the interventions in the field institutional setting. In addition, observational or nonexperimental survey data that parallel outcomes from a field or survey experiment provide additional descriptive evidence when the scholar is unable to conduct an experimental replication, or a survey experiment with political elites could be paired with a similar survey experiment of regular citizens.

8.4 Experiments with Political Institutions: Be Bold, but Be Careful and Ethical Moving the discipline forward requires scholars consider ways to incorporate randomized experimentation into our study of political institutions. Yet there still seems

Experiments, Political Elites, and Political Institutions

to be reticence among some scholars of institutions who tend to prefer not to conduct experiments. Criticisms of experimental interventions from institutions scholars exist, and they are often whispered at conferences and in professional associations. Typically, one concern is that experimental interventions burden the time of political elites. I agree that scholars of institutions must carefully consider the ethics of experiments and interventions (whether these interventions are randomized or not; see Grose 2015). However, scholars should conduct bold and meaningful experiments in political institutions, as most experimental interventions with political elites have been relatively minimal when compared to scholars using nonexperimental methodological approaches. For instance, most extant research using field experiments to study political institutions and elites engages in very light and minimal interventions, contrary to the impression among some institutions scholars. The audit studies of legislators and other elected officials are typically short emails to which the officials can choose to respond in a matter of seconds. Contrast these minimal interventions to interventions of other political institutions scholars with observational approaches, which are typically much bolder in terms of the time and other commitments required from public officials. Scholars have used Freedom of Information Act (FOIA) laws to request information and data, even if the FOIA requests have been particularly time-intensive or burdensome. Classic and important research conducted long surveys of the membership of the US Congress (Miller and Stokes 1963) or allowed for participant observation of US House members for hours or days at a time (Fenno 1978). This nonexperimental research placed enormous time burdens on public officials – and also was some of the most consequential research on political institutions for generations. In fact, just as critics of experiments with institutions today lament the time burdens to public officials, so did scholars of an earlier era lament the time burdens of participant

159

observation, interviews, and surveys. Dexter (1964), writing in Public Opinion Quarterly, went so far as to say there were too many PhD students nonexperimentally intervening with federal public officials in the 1950s and 1960s. Dexter lamented that these PhD students – such as a young Dick Fenno – were wasting the time and “good will of important people” by conducting interviews and surveys of federal elected officials. These criticisms of observational interventions decades ago mirror some of those criticisms of experimental interventions with political elites heard by some contemporary scholars of institutions. However, in contrast, most experimental interventions with political elites conducted solely by scholars have been fairly minimal in terms of time burdens to the elected officials relative to much of the classic work in the study of political institutions and representation. When one is an expert who studies institutions, there will be substantial time commitments and interventions with political elites. As a result, it is extremely important for experimentalists to value and respect their ethical obligations when interacting with elected officials so as to allow for continued experimental and nonexperimental interventions with these political elites by future scholars. But given that most of the current experimental scholarship on political elites has focused mostly on one-time, small email interventions with elected officials via audits, I would recommend that experimental scholars of institutions think bigger, but think ethically and carefully. In future research, bolder interventions that allow for electoral or political institutions to be randomly varied could be a goal of scholars, and I anticipate other clever and impactful randomized experiments with political elites. There has always been an appetite for political institutions to conduct randomization within their own institutions, and thus scholars can also work with these institutions to conduct significant and new research. Just as Miller and Stokes (1963) and Fenno (1978) boldly intervened with legislators nonrandomly in an earlier scholarly generation, the

160

Christian R. Grose

current vanguard of scholarship can engage in meaningful experimental interventions with public officials within political institutions. The research must be carefully designed so as to be ethical and respectful to political elites. Given the possibilities of pool depletion and small samples, the theoretical merit of the studies must be very high for the scholar to engage in randomized experiments with political elites. Scholars should not engage in purely scholarly led interventions that have the potential to put any elected official at higher-than-minimal risk. For the boldest interventions, scholars should have outside groups conduct interventions. An advocacy organization who planned on lobbying elected officials nonrandomly – a significant and bold intervention – will do so whether the lobbying is randomized or not. Thus, having the group randomize the intervention of lobbying is new and impactful scholarship, but it is also ethically acceptable, as elected officials were already receiving the interventions from the outside group. This type of advocacy experiment would not be a good idea for a solely scholarly led intervention, but would be ideal for an outside group to partner with a scholar to conduct. It is essential that scholars carefully consider and review the ethics of experiments with political elites, but also to think creatively on how new experiments can be conducted in the field, the lab, and with surveys to improve our understanding of political institutions. By intervening with and observing directly our elected officials and political institutions, we learn. To summarize, there are significant challenges – but also exciting and theoretically important opportunities for scholars interested in experiments in the study of political institutions. The frontier of experimental interventions with political elites and with political institutions should be bold and ethical. The best strategy for conducting such bold and ethical experimental research may be to partner with external groups who conduct interventions that are randomized.

8.5 Conclusion The rise of experiments in political science in the 2000s is clear from a perusal of the discipline’s top journals (Robison et al. 2018). The study of political institutions and political elites has also seen a rise in experimental methods over time, but the number of experiments – especially field experiments – is relatively small in the study of institutions. Good, theoretically driven work that empirically leverages field, survey, or laboratory experiments is critical to enhance our understanding of political institutions. The method should not drive the question, but the theory and the question should drive the empirical analysis. We have had decades of important questions about political polarization, parties in legislatures, and many other topics answered with observational data, and important work in this vein will continue. There are also now many unanswered questions that can be addressed with field experimental and other experimental methods, and existing questions that have not been examined with experimental or causal methods. Even more scholarship on political institutions should pair experimental methods with observational methods. There is much scholarship that remains to be done utilizing experimental methods to help us understand the behavior of political elites and political institutions. The gold standard of experiments to assess causality is an emergent method in the study of political elites and institutions.

References Andersen, Simon Calmar, and Donald P. Moynihan. 2016. “Bureaucratic Investments in Expertise: Evidence from a Randomized Controlled Field Trial.” Journal of Politics 78: 1032–1044. Anderson, Sarah E., Daniel M. Butler, and Laurel Harbridge. 2016. “Legislative Institutions as a Source of Party Leaders’ Influence.” Legislative Studies Quarterly 41: 605–631. Avellaneda, Claudia. 2013. “Mayoral Decisionmaking: Issue Salience, Decision Context, and

Experiments, Political Elites, and Political Institutions Choice Constraint? An Experimental Study with 120 Latin American Mayors.” Journal of Public Administration Research and Theory 23: 631–661. Bergan, Daniel E. 2009. “Does Grassroots Lobbying Work? A Field Experiment Measuring the Effects of an E-mail Lobbying Campaign on Legislative Behavior.” American Politics Research 37: 327–352. Broockman, David E. 2014. “Distorted Communication, Unequal Representation: Constituents Communicate Less to Representatives Not of Their Race.” American Journal of Political Science 58: 307–321. Broockman, David E., and Daniel M. Butler. 2015. “Do Better Committee Assignments Meaningfully Benefit Legislators? Evidence from a Randomized Experiment in the Arkansas State Legislature.” Journal of Experimental Political Science 2: 152–163. Bussell, Jennifer. 2019. “When are Legislators Partisan? Targeted Distribution and Constituency Service in India.” Manuscript, University of California, Berkeley. URL: http://egap.org/ sites/default/files/Bussell_%20EGAP20.pdf Butler, Daniel M. 2014. Representing the Advantaged: How Politicians Reinforce Inequality. New York: Cambridge University Press. Butler, Daniel M. 2019. “Facilitating Field Experiments at the Subnational Level.” Journal of Politics 81: 371–376. Butler, Daniel M., and David Broockman. 2011. “Do Politicians Racially Discriminate against Constituents? A Field Experiment on State Legislators.” American Journal of Political Science 55: 463–477. Butler, Daniel M., and David W. Nickerson. 2011. “Can Learning Constituency Opinion Affect How Legislators Vote? Results from a Field Experiment.” Quarterly Journal of Political Science 6: 55–83. Chong, Dennis, and James N. Druckman. 2007. “A Theory of Framing and Opinion Formation in Competitive Elite Environments.” Journal of Communication 57: 99–118. Cirone, Alexandra, and Brenda Van Coppenolle. 2019. “Bridging the Gap: Lottery-based Procedures in Early Parliamentarization.” World Politics 71: 197–235. Coppock, Alexander. 2019. “Avoiding Posttreatment Bias in Audit Experiments.” Journal of Experimental Political Science 6: 1–4. Costa, Mia. 2017. “How Responsive are Political Elites? A Meta-analysis of Experiments on Public Officials.” Journal of Experimental Political Science 4: 241–254.

161

Cover, Albert D., and Bruce S. Brumberg. 1982. “Baby Books and Ballots: The Impact of Congressional Mail on Public Opinion.” American Political Science Review 76: 347–359. Dahl, Malte. 2019. “Detecting Discrimination: How Group-based Biases Shape Economic and Political Interactions.” PhD dissertation, University of Copenhagen. Darmofal, David, Charles J. Finocchiaro, and Indridi H. Indridason. 2019. “Roll-call Voting Under Random Seating Assignment.” Working paper. De Rooij, Eline A., Donald P. Green, and Alan S. Gerber. 2009. “Field Experiments on Political Behavior and Collective Action.” Annual Review of Political Science 12: 389–395. Dexter, Lewis A. 1964. “The Good Will of Important People: More on the Jeopardy of the Interview.” Public Opinion Quarterly 28: 558–563. Druckman, James N., and Julia Valdes. 2019. “How Private Politics Alters Legislative Responsiveness.” Quarterly Journal of Political Science 14: 115–130. Eber, Lauren. 2005. “Waiting for Watergate: The Long Road to FEC Reform.” Southern California Law Review 79: 1155–1202. Eckel, Catherine C., Enrique Fatas, and Richard K. Wilson. 2010. “Cooperation and Status in Organizations.” Journal of Public Economic Theory 12: 737–762. Eldersveld, Samuel J. 1956. “Experimental Propaganda Techniques and Voting Behavior.” American Political Science Review 50: 154–165. Enemark, Daniel, Mathew D. McCubbins, and Nicholas Weller. 2014. “Knowledge and Networks: An Experimental Test of How Network Knowledge Affects Coordination.” Social Networks 36: 122–133. Erikson, Robert S., and Laura Stoker. 2011 “Caught in the Draft: The Effects of Vietnam Draft Lottery Status on Political Attitudes.” American Political Science Review 105: 221–237. Fenno, Richard F. 1978. Home Style: House Members in Their Districts. Boston, MA: Little, Brown. Fiorina, Morris P., and Charles R. Plott. 1978. “Committee Decisions Under Majority Rule: An Experimental Study.” American Political Science Review 72: 575–598. Gell-Redman, Micah, Neil Visalvanich, Charles Crabtree, and Christopher J. Fariss. 2018. “It’s All About Race: How State Legislators Respond to Immigrant Constituents.” Political Research Quarterly 71: 517–531. Gerber, Alan, and Donald P. Green. 2000. “The Effects of Canvassing, Telephone Calls, and

162

Christian R. Grose

Direct Mail on Voter Turnout: A Field Experiment.” American Political Science Review 94: 653–663. Gerber, Alan and Donald P. Green. 2014. “Field Experiments on Voter Mobilization: An Overview of a Burgeoning Literature.” In Handbook of Economic Field Experiments. Amsterdam: Elsevier, pp. 395–438. Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52: 647–674. Gosnell, Harold. 1927. Getting-Out-the-Vote: An Experiment in the Stimulation of Voting. Chicago, IL: University of Chicago Press. Grose, Christian R. 2014. “Field Experimental Work on Political Institutions.” Annual Review of Political Science 17: 355–370. Grose, Christian R. 2015. “Field Experiments on Elected and Public Officials: Ethical Obligations and Requirements.” In Ethics and Experiments, ed. Scott Desposato. London: Routledge, pp. 227–238. Grose, Christian R. 2019. “Policy Diffusion Occurs Legislator to Legislator across State Lines: A Survey Experiment of the National Conference of Environmental State Legislators.” Working paper. Grose, Christian R., and Abby K. Wood. 2020. “Randomized Experiments by Government Institutions and American Political Development.” Public Choice. DOI: 10.1007/s11127019-00704-5. Grose, Christian R., and Jordan Carr Peterson. 2020. “Economic Interests Cause Elected Officials to Liberalize Their Racial Attitudes.” Political Research Quarterly 73: 511–525. Grose, Christian R., Neil Malhotra, and Robert P. Van Houweling. 2015. “Explaining Explanations: How Legislators Explain Their Positions and How Citizens React.” American Journal of Political Science 59: 724–743. Grose, Christian R., Pamela Lopez, Sara Sadhwani, and Antoine Yoshinaka. Forthcoming. “Social Lobbying.” Journal of Politics. Hall, Andrew B., Connor Huff, and Shiro Kuriwaki. 2019. “Wealth, Slaveownership, and Fighting for the Confederacy: An Empirical Study of the American Civil War.” American Political Science Review 113: 658–673. Hall, Matthew E. K. 2009. “Experimental Justice: Random Judicial Assignment and the Partisan Process of Supreme Court Review.” American Politics Research 37: 195–226. Hall, Matthew E. K. 2010. “Randomness Reconsidered: Modeling Random Judicial

Assignment in the U.S. Courts of Appeals.” Journal of Empirical Legal Studies 7: 574–589. Iyengar, Shanto, and Donald R. Kinder. 1987. News That Matters: Television and American Opinion. Chicago, IL: University of Chicago Press. Jo, Donghee, and Matt Lowe. 2019. “The Limits of Political Integration: A Natural Experiment in Iceland.” Working paper. Kalla, Joshua L., and David E. Broockman. 2016. “Campaign Contributions Facilitate Access to Congressional Officials: A Randomized Field Experiment.” American Journal of Political Science 60: 545–558. Kalla, Joshua L., Frances Rosenbluth, and Dawn Langan Teele. 2018. “Are You My Mentor? A Field Experiment on Gender, Ethnicity, and Political Self-starters.” Journal of Politics 80: 337–341. Kanthak, Kristin, and Jonathan Woon. 2015. “Women Don’t Run? Election Aversion and Candidate Entry.” American Journal of Political Science 59: 595–612. Karpowitz, Christopher F., J. Quin Monson, and Jessica R. Preece. 2017. “How to Elect More Women: Gender and Candidate Success in a Field Experiment.” American Journal of Political Science 61: 927–943. Kastellec, Jonathan P. 2011. “Panel Composition and Voting on the U.S. Courts of Appeals over Time.” Political Research Quarterly 64: 377–391. Kingdon, John W. 1973. Congressmen’s Voting Decisions. New York: Harper and Row. Kriner, Douglas L., and Eric Schickler. 2014. “Investigating the President: Committee Probes and Presidential Approval, 1953– 2006.” Journal of Politics 76: 521–534. Kriner, Douglas L., and Eric Schickler. 2017. Investigating the President: Congressional Checks on Presidential Power. Princeton, NJ: Princeton University Press. Lagunes, Paul, and Oscar Pocasangre. 2019. “Dynamic Transparency: An Audit of Mexico’s Freedom of Information Act.” Public Administration 97: 162–176. Lajevardi, Nazita. 2018. “Access Denied: Exploring Muslim American Representation and Exclusion by State Legislators.” Politics, Groups, and Identities. DOI: 10.1080/21565503.2018. 1528161. Laponce, J. A. 1966. “An Experimental Method to Measure the Tendency to Equibalance in a Political System.” American Political Science Review 60: 982–993. Larimer, Christopher. 2018. “Voter Turnout Field Experiments.” Oxford Bibliographies.

Experiments, Political Elites, and Political Institutions URL: www.oxfordbibliographies.com/view/ document/obo-9780199756223/obo-9780199 756223-0243.xml Levine, Adam Seth. 2020. “Research Impact Through Matchmaking (RITM): Why and How to Connect Researchers and Practitioners.” PS: Political Science and Politics. DOI: 10. 1017/S1049096519001720. Levy, Marin K. 2017. “Panel Assignment in the Federal Courts of Appeals.” Cornell Law Review 103: 65–116. Lowande, Kenneth, and Andrew Proctor. 2019. “Bureaucratic Responsiveness to LGBT Americans.” American Journal of Political Science. DOI: 10.1111/ajps.12493. Mayhew, David R. 1974. Congress: The Electoral Connection. New Haven, CT: Yale University Press. Mendez, Matthew S. 2015. “In/visible Constituents: The Representation of Undocumented Immigrants.” PhD dissertation, University of Southern California. URL: http://digitallibrary.usc.edu/cdm/ref/collection /p15799coll3/id/622883 Mendez, Matthew S., and Christian R. Grose 2018. “Doubling Down: Inequality in Responsiveness and the Policy Preferences of Elected Officials.” Legislative Studies Quarterly 43: 457– 491. Miller, Warren A., and Donald E. Stokes. 1963. “Constituency Influence in Congress.” American Political Science Review 57: 45–56. Morton, Rebecca B., and Kenneth C. Williams. 2010. Experimental Political Science and the Study of Causality. New York: Cambridge University Press. Naurin, Elin, and Patrik Ohberg. 2019. “Ethics in Elite Experiments: A Perspective of Officials and Voters.” British Journal of Political Science. DOI: 10.1017/S0007123418000583. Nielsen, Poul A., and Donald P. Moynihan. 2017. “Romanticizing Bureaucratic Leadership? The Politics of How Elected Officials Attribute Responsibility for Performance.” Governance 20: 541–559. Palfrey, Thomas R. 2009. “Laboratory Experiments in Political Science.” Annual Review of Political Science 12: 379–388. Porter, Ethan, and Jon C. Rogowski. 2018. “Partisanship, Bureaucratic Responsiveness, and Election Administration: Evidence from a Field Experiment.” Journal of Public Administration Research and Theory 28: 602–617. Riker, William H. 1967. “Bargaining in Three Person Games.” American Political Science Review 61: 342–356.

163

Riker, William H., and William J. Zavoina. “Rational Behavior in Politics: Evidence from a Three Person Game.” American Political Science Review 64: 48–60. Robison, Joshua, Randy T. Stevenson, James N. Druckman, Simon Jackman, Jonathan N. Katz, and Lynn Vavreck. 2018. “An Audit of Political Behavior Research.” SAGE Open. DOI: 10.1177/2158244018794769. Rogowski, Jon C., and Betsy Sinclair. 2017. “Estimating the Causal Effects of Social Interactions with Endogenous Networks.” Political Analysis 20: 316–328. Sheffer, Lior, Peter John Loewen, Stuart Soroka, Stefaan Walgrave. 2018. “Nonrepresentative Representatives: An Experimental Study of the Decision Making of Elected Politicians.” American Political Science Review 112: 302–321. Sunstein, Cass R., David Schkade, Lisa M. Ellman, and Andres Sawicki. 2006. Are Judges Political? An Empirical Analysis of the Federal Judiciary. Washington, DC: Brookings Institution Press. Thomsen, Danielle M., and Bailey K. Sanders. 2019. “Gender Differences in Legislator Responsiveness.” Perspectives on Politics. DOI: 10.1017/S1537592719003414. Titiunik, Rocio. 2016. “Drawing Your Senator from a Jar: Term Length and Legislative Behavior.” Political Science Research and Methods 4: 293–316. Titiunik, Rocio, and Andrew Feher. 2018. “Legislative Behaviour Absent Re-election Incentives: Findings from a Natural Experiment in the Arkansas Senate.” Journal of the Royal Statistical Society 181: 351–378. Vaishnav, Milan, Saksham Khosla, Aidan Milliff, and Rachel Osnos. 2019. “Digital India? An Email Experiment with Indian Legislators.” India Review 18: 243–263. Weiman, David F. 1991. “Peopling the Land by Lottery? The Market in Public Lands and the Regional Differentiation of Territory on the Georgia Frontier.” Journal of Economic History 51: 835–860. White, Ariel R., Noah L. Nathan, and Julie K. Faller. 2015. “What Do I Need to Vote? Bureaucratic Discretion and Discrimination by Local Election Officials.” American Political Science Review 109: 129–142. Williams, Brian D., and Indridi Indridason. 2018. “Luck of the Draw? Private Members’ Bills and the Electoral Connection.” Political Science Research and Methods 6: 211–227. Wood, Abby K., and Christian R. Grose. 2019. “Random Audits and Regulatory Compliance.”

164

Christian R. Grose

The Regulatory Review, November URL: www.theregreview.org/2019/11/21/woodgrose-random-audits-regulatory-compliance/. Wood, Abby K., and Christian R. Grose. Forthcoming. “Campaign Finance Transparency Affects Legislators’ Election Outcomes and Behaviors.” American Journal of Political Science. Wood, Abby K., and David E. Lewis. 2017. “Agency Performance Challenges and Agency

Politicization.” Journal of Public Administration Research and Theory 27: 581–595. Zelizer, Adam. 2018. “How Responsive Are Legislators to Policy Information? Evidence from a Field Experiment in a State Legislature.” Legislative Studies Quarterly 43: 595–618. Zelizer, Adam. 2019. “Is Position-taking Contagious? Evidence of Cue-taking from Two Field Experiments in a State Legislature.” American Political Science Review 113: 340–352.

CHAPTER 9

Convenience Samples in Political Science Experiments∗

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

Abstract This chapter provides an overview of convenience sampling in political science experimental research. We first define convenience sampling and track its use in political science research. Next, we focus on the use of three sample types: undergraduate convenience samples, crowdsourced convenience samples, and other types such as social media and nonstudent convenience samples. We review empirical research on the potential issues with each convenience sample type, along with best practices that can be used to address those issues. Overall, we conclude that while there are justified concerns that scholars should be aware of when using convenience samples, much of the empirical research suggests that they provide valid results for experimental treatment effects that reliably replicate across more representative probability samples. As reliance on experiments in political science has grown, so too has the diversity of samples used in these experiments. While early experimental studies relied on undergraduate students (Iyengar 2011), new experimental modes now allow scholars to branch out to different types of samples (Franco et al. 2017). Sometimes this means relying on * We are grateful to Diana Mutz, Christian Grose, Jacob Rothschild, James N. Druckman, and Donald P. Green for their very helpful comments on this chapter.

survey companies to recruit probability samples for experimental participation (Mutz 2011); other times, however, the search for participants means turning to new ways of recruiting convenience samples. In this chapter, we explore the use of convenience samples in experimental political science. While our discussion focuses on survey and lab experiments, we also consider the role convenience samples play in field experiments (see Chapter 4 in this volume) and audit studies (see Chapter 3 in this volume). We note that much of the research we will 165

166

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

review in this chapter focuses on the extent to which convenience samples generalize to the American population. This is because, to date, most of the work on convenience samples has considered factors such as external validity and generalizability within the US context.

9.1 What Is a Convenience Sample? In a probability sample, members of some target population have an equal probability of being drawn into a sample. More generally, participants who form a probability sample are selected because they form a representative sample of that target population. This is not the case in a convenience sample. First, in a convenience sample, participant selection is based on the ease of access, where access is defined broadly. Convenience sample participants are invited to take part in the study because they are physically near the experimental site, are especially willing to participate, or can be recruited at a low transaction cost. Some convenience samples may have all three of these characteristics, but others may reflect only one or two. This leads to the second feature of a convenience sample. Those in the target population who are “convenient” are significantly more likely to be recruited than other target population members. As a result, convenience samples have “no design-based justifications for generalizing from the sample to the population” (Coppock and McClellan 2019, p. 2). Following this definition, undergraduate students can easily be categorized as convenience samples, as they are invited to participate based largely on proximity and availability (Druckman et al. 2011; Maxwell and Delaney 2004). On the other end of the spectrum, samples that are obtained through survey companies that rely on probability sampling of a target population are not convenience samples (Mutz 2011). While these are the opposing ends of a continuum, however, there are other samples that political scientists often rely on that are neither undergraduates nor probability samples (Franco et al. 2017). These samples include crowdsourced data and participants

recruited by survey companies that do not use probability sampling. We categorize crowdsourced data – data that rely on volunteers who respond to an open call for experiment participants – as convenience samples. Since convenience samples are made up of people who have a high willingness to take part in research and are easily accessible by the researcher, members of crowdsourcing platforms meet both of these criteria. Although not geographically proximate like undergraduates, these people are easily reached via platforms that are deliberately designed for researchers to find willing study participants. These participants are not invited to take part in a study because they are a representative sample of the target population – they are invited because they happen to be easily accessible. If crowdsourced participants form convenience samples, how would one characterize samples from survey companies that are nonprobability, such as lower-cost survey companies that maintain panels of participants (e.g., Dynata, Qualtrics Panel)? We turn again to the definition of a convenience sample that rests on the ease of recruitment. Although these types of samples may be lower cost and are non-probability, a researcher must still go through a survey company to reach these participants. Moreover, the survey company often attempts some form of national representativeness by performing deliberate balancing on the types of participants who are invited to take part in a researcher’s study.1 For ease of discussion in this chapter, we will term these samples “balanced samples.”2 Since participants are deliberately invited to participate in a study in order to provide some particular sample outcome – rather than only because they are easily accessible – and since the researcher cannot access these participants directly, these types of samples do not

1 This makes it difficult to calculate a true response rate because different participants may be invited at different rates and in different ways throughout fielding depending on the remaining need. 2 We also use the term “balanced samples” following internal descriptions of their sampling procedures by these types of survey companies (Pieters 2018).

Convenience Samples in Political Science Experiments

meet the definition of convenience samples.3 These balanced samples are not probability samples, but they are not convenience samples either.4

9.2 How Often Are Convenience Samples Used? To contextualize the use of convenience samples in political science research, we review their use in the discipline’s three leading journals: American Political Science Review, American Journal of Political Science, and the Journal of Politics. We track the use of convenience samples (e.g., undergraduate, crowdsourced, local residents) versus other types of samples in experiments. In total, we distinguish between five types of samples: (1) undergraduate (convenience), (2) crowdsourced (convenience), (3) balanced (e.g. Dynata and other non-probability survey companies), (4) YouGov (weighted to probability), and (5) GfK (probability).5 We distinguish within sample categories to offer more fine-grained data on the types of samples present in the discipline’s top journals, as well as to follow previous research on the topic (e.g., Franco et al. 2017).6 We consider the articles from 2014 through 2018. 3 Another form of sample that has emerged is the pure volunteer sample, where people volunteer to be contacted for studies even if they do not receive any compensation for participation (e.g., Strange et al. 2018). In most cases, the researchers can access these participants directly, and since they are cost-free, they are likely convenient. Therefore, volunteer samples are convenience samples. 4 Examples of a grey a area are Lucid (Coppock and McClellan 2017), which allows for quota sampling but can include participants because they are accessible rather than to meet some desired representation, as well as MTurk Prime, which also attempts to give the researcher some more control over a crowdsourced sample. Given the definitions in this chapter, Lucid seems closer to a balanced sample than a convenience sample. 5 In some of the articles in this sample, YouGov is described by its earlier title: Polimetrix. 6 Data obtained through Google Scholar searches for each journal from 2014 through 2018. GfK uses probability-based sampling methods to obtain representative samples; YouGov uses model-based techniques to sample to an approximate representative population.

167

Figure 9.1 shows the counts of experimental studies in articles by year and journal for each sample type. For articles that include multiple experiments, each experiment is counted once regardless of whether all of the experiments relied on the same sample or on different samples. The 139 unique articles included in Figure 9.1 had a total of 241 experiments. Over the five-year period, we see a general increase in the use of experiments. Notably, however, we also see the continued presence of convenience samples, with 22 convenience sample studies in 2014 and 24 convenience sample studies in 2018. These patterns are in line with Franco et al. (2017), who also demonstrate increased reliance on convenience samples in the form of crowdsourced data. Keeping these trends in mind, we consider the different types of convenience samples used in political science. Following the sample categories in Figure 9.1, we begin with undergraduate samples, then consider crowdsourced samples, and, finally, turn to other types of convenience samples that are used less often (e.g., local residents, campus staff). In all cases, we consider empirical research on the potential drawbacks of relying on this type of convenience sample, as well as best practices.

9.3 Undergraduate Convenience Samples The undergraduate student sample was perhaps the first type of convenience sample to be widely used, as scholars capitalized on the readily available student populations of the institutions where they conducted their research. The use of undergraduate student samples first proliferated in social psychology research beginning in the 1960s, and they have been a consistent source for data since then across social science disciplines, including political science, economics, business, and psychology. One of the main benefits of undergraduate convenience samples is cost, since students are typically compensated for study participation with course credit rather than monetary

168

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

Figure 9.1 Use of various samples in political science journals. In the figure, the y-axis is the number of experiments that rely on a given sample. This means that a single article with three experiments produces three separate cases in this count. APSR = American Political Science Review; AJPS = American Journal of Political Science; JOP = Journal of Politics.

remuneration. Even when student subjects are compensated financially, the costs to the researcher tend to be quite low. For instance, payment can be in the form of a lottery for a gift certificate. Or in the case of incentivized experiments in economics, the ultimate payment is dependent on the subject’s own performance, so the expected total subject costs are at the average rather than the maximum rate of possible payout. Still, participation for course credit instead of for money continues to be the norm for student samples, and thus undergraduates provide the cheapest and one of the most accessible convenience samples. In this section, we first consider the limitations and concerns that have been raised regarding the use of undergraduate samples. Then we consider suggestions for best practices that may help to maximize the utility of and researchers’ confidence in sam-

pling undergraduate participants based on empirical comparisons between student and nonstudent adult samples. 9.3.1 Concerns about Undergraduate Convenience Samples Sears (1986) famously raises several potential concerns regarding the use of undergraduate student samples in research on social and political topics. The primary issue raised by Sears (1986) is the potential misrepresentation of relationships between variables due to meaningful differences between student and nonstudent adults – that is, whether results gleaned from college students would be generalizable to the broader population. In other words, “college sophomores may not be people,” as Hovland (1959) stated, quoting Tolman. (Indeed, this is a common critique of the use of convenience samples in general.)

Convenience Samples in Political Science Experiments

For instance, an effect may be detected among undergraduates but not to any significant extent among more representative adult samples, and conversely, an effect may not be visible among undergraduates even though it would be detected among other adults. Here, we outline the primary issues of potential concern, bringing evidence to bear when available. We note an important caveat: although the Sears (1986) argument has been influential, there has been little evidence that college students are “not people.” A key exception is Henry (2008), which considers Sears’ arguments directly in the context of racial attitudes – a study we turn to later in this section. So why might researchers be suspicious that there are meaningful differences between college students and adults? One feature of this “narrow base” sample is undergraduates’ level of education and attendant abilities. Those who are admitted to college are understood to possess particularly skillful cognitive abilities, including the cognitive capacity and motivation to facilitate ideological reasoning in particular (Highton 2009). These cognitive features likely do differentiate college students from those who have not attained a college education. On the other hand, however, college students are likely quite similar to general-population adults with higher levels of education. Another obvious factor of the “narrow base” is that undergraduates typically comprise a narrow age range in late adolescence and early adulthood. Individuals in this period are especially impressionable when it comes to political attitudes (Alwin and Krosnick 1991; Osborne et al. 2011), so measurement of their political preferences may be quite unstable compared to a wider range of adults across the lifespan. Panel studies often demonstrate that older adults tend to have more stable political attitudes than young adults and late adolescents (Jennings and Markus 1984; Jennings and Niemi 1981). What is more, many people first develop an interest in politics in their late teens or early twenties (Russo and Stattin 2017), with political interest crystallizing and stabilizing in young adulthood (Prior

169

2010, 2019), and ideological preferences often crystallizing during the first electoral campaign in which individuals are eligible to vote (Sears and Valentino 1997). These features of development in young adulthood suggest that college students are politically unique, perhaps making it somewhat difficult to generalize from their political attitudes to the larger population. College students are commonly thought to be more ideologically uniform than the broader population of adults, perhaps as a function of the influence of their education (Newcomb 1943), as well as their vulnerability to social pressures (Sears 1986). Indeed, casual observations from scholars often note the supposed liberal skew of undergraduates, with the assumption that the largely uniformly liberal views of students will bias the results of the studies in which they participate. Yet in a recent and large analysis of undergraduate ideological development in 38 different colleges, Mariani and Hewitt (2008) point out that “student orientation when leaving college is not significantly different than the population at large” (p. 779). A final concern may be the university context itself. Focusing on measures of racial attitudes, Henry (2008) compares undergraduate students to a convenience sample of nonstudents living in a similar geographic area. Across a variety of measures, undergraduates had more positive racial attitudes and perceived diversity as more important than the nonstudent sample. Henry ascribes this pattern to the social context of the university where students are exposed to more diversity and “liberally oriented groups” (p. 59). What may differentiate the undergraduate sample, then, is that “the campus environment could be described as unique in America compared to other environments, such as executive boardrooms, corner pubs, athletic stadiums, or suburban kitchens” (Henry 2008, p. 59). Contextual factors related to changes in how studies are conducted may also affect the data collected from undergraduate subjects. Specifically, the increasing reliance on online data collection (vs. in-person lab-based data collection) to recruit student subjects appears to affect data quality itself. Clifford and Jerit

170

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

(2014) compare undergraduate respondents who participate in a lab study with those who participate online, finding that although the results for both study modes are similar, online respondents are more distracted and more likely to look up answers to political knowledge questions. Anecdotal observations from scholars also suggest the possibility that data quality from students can vary at different points in the semester (with more attentive and conscientious students participating earlier). Such anecdotal observations also note deterioration in overall attentiveness as students are increasingly able to take online studies on their smartphones – even while sitting in class. Thus, researchers using undergraduate samples may want to consider returning many of their studies to the lab, rather than relying solely on online data collection. 9.3.2 Best Practices for the Use of Undergraduate Convenience Samples The question of whether students vary meaningfully from nonstudent adult populations warrants empirical scrutiny. As Druckman and Kam (2011) note, “[I]f the underlying data generating process is characterized by a homogeneous treatment effect … then any convenience sample should produce an unbiased estimate of that single treatment effect, and, thus, the results from any convenience sample should generalize easily to any other group” (emphasis added). But if there is a heterogeneous treatment effect (i.e., a moderating or individual difference factor), then that has the potential to reduce or eliminate generalizability (see also Chapter 21 in this volume). Thus, Druckman and Kam (2011) examine survey data from both college and noncollege samples to compare the two sample types for variability on several dimensions that political scientists typically consider. Strikingly – and in contrast to Sears’ (1986) concerns – they find that, across most dimensions, students and nonstudent adults are not different, specifically in terms of partisanship, ideology, importance of religion, belief in limited government, views about homosexuality

and immigrants, social trust, extent of following and discussing politics, and general media use. The few dimensions on which students are different from nonstudent adults are religious attendance, level of political information, and some specific types of media use. By and large, students appear to be quite similar to nonstudent adults in most key variables pertinent to political scientists’ interests. Comparisons of students across different institutions with varying demographic characteristics show they also behave similarly to one another in the same experimental contexts (Lupton 2018). Moreover, other investigations that have scrutinized the claim that college students behave in markedly different ways in experiments compared to the general population have, for the most part, shown that the average treatment effects in student samples are similar to more representative samples (Mullinix et al. 2015; although see Henry 2008 for notable differences in racial attitudes). For instance, Yeager et al. (2019) demonstrate that across social psychology experiments tracking conformity, persuasion, base-rate utilization, law of large numbers, and conjunction fallacy, representative samples of American adults reliably replicate the results found with undergraduate samples. Although the effect sizes tend to be smaller in the representative samples than in the undergraduate samples, Yeager et al. (2019) observe that the replicated effects are particularly strong among adult participants who share demographic characteristics with the typical college student samples of the original studies. Thus, we would suggest that political scientists carefully consider whether the variables they are investigating are likely to be ones in which college students are meaningfully different from noncollege adults, and if not, they can confidently conduct their studies to make cautious generalizations. Finally, to the extent that undergraduates are distinct from nonstudent adult populations on certain dimensions, they may provide especially useful participants for studies in which political scientists are examining those very dimensions. For example, scholars

Convenience Samples in Political Science Experiments

of political attitude development and crystallization can capitalize on college students’ relative attitudinal malleability. Or political scientists interested in examining specific levels of religious attendance or political knowledge could likely learn a great deal from narrowing their view to undergraduate samples. Although undergraduate students are unique in a few ways compared to nonstudent adults, they may not be as peculiar of a subject type as Sears (1986) feared. Rather, they appear to have more in common with the broader population than previously thought – and college sophomores may, in fact, be people.

9.4 Crowdsourced Convenience Samples Over the last several years, scholars have increasingly turned to crowdsourced data (see Franco et al. 2017 and Figure 9.1 of this chapter). While the main source has been Amazon’s Mechanical Turk (MTurk), there are a number of other platforms for participant recruitment. Prolific Academic (ProA), for example, is designed directly for research, while CrowdFlower (CF) allows scholars to obtain more identifying information from participants (something that MTurk explicitly forbids) (Peer et al. 2017). The reliance on crowdsourcing platforms for convenience samples is not surprising. They provide a means of collecting data quickly and, often, at a significantly lower price point than survey companies. The emergence of crowdsourced convenience samples can allow scholars with fewer resources to run experiments on samples that are, potentially, more diverse than undergraduates. In this section, we consider the concerns about and emerging best practices for the use of crowdsourced convenience samples. In considering these approaches, we note that much of the empirical work on the robustness and generalizability of crowdsourced samples that has been completed to date compares the results obtained with

171

these samples to more representative samples of American participants. As a result, there is much less evidence that the results obtained with crowdsourced samples generalize beyond US populations. We consider this limitation further at the end of our discussion of crowdsourced data. 9.4.1 Concerns about Crowdsourced Convenience Samples Since the emergence of crowdsourced convenience samples, scholars have expressed concerns and critiques about their validity. While some of these concerns have appeared in peer-reviewed articles (e.g., Ford 2017; Harms and DeSimone 2015), many have come through informal feedback and conversation (i.e., comments in anonymous reviews and during presentations), as well as in academic blogs (Gelman 2013; Kahan 2013). We can sort some of the most common critiques of crowdsourced data into three categories: (1) sample demographics, (2) the non-naivete of subjects, and (3) response inconsistencies.

9.4.1.1 sample demographics Crowdsourced samples can be defined as convenience samples because they include participants who are both highly willing to take part in studies (often for a relatively small amount of compensation) and readily accessible to researchers. These characteristics can manifest in two ways. First, participants decide on their own to join a crowdsourcing website. Second, participants can select from lists of available tasks they can complete. The type of decision-making that leads people to a crowdsourcing website, however, can affect the demographic characteristics of the eventual recruited sample. Much of the research on the potential demographic peculiarities of crowdsourced samples has focused on the use of MTurk, as it is the most established and frequently used crowdsourcing platform. This research compares the samples obtained via MTurk to benchmarks – survey company-recruited general population samples. Most of these comparisons are of US-based MTurk samples

172

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

to a general population US sample obtained by a survey company. There is some evidence that crowdsourced samples could be similar to online samples recruited by survey companies. Huff and Tingley (2015), for example, find that Cooperative Congressional Election Study (CCES) respondents and participants recruited via MTurk have similar patterns in occupations and generally come from similar geographic locations in the USA. There is more research, however, highlighting demographic differences. In his critique of MTurk samples, Kahan (2013) writes MTurk-recruited samples overrepresent women but underrepresent African Americans relative to the US population. MTurk convenience samples can also underrepresent conservatives relative to other national studies (Huff and Tingley 2015). MTurk samples, others suggest, may also skew younger and more educated than other national samples (Huff and Tingley 2015). They are also less likely to be married (Berinsky et al. 2012; Shapiro et al. 2013), report lower personal incomes, and, relative to the general US population, are more likely to be unemployed (Shapiro et al. 2013). They are also more likely to self-report as lesbian, gay, or bisexual (Corrigan et al. 2015). While there is less empirical research about the demographic characteristics of other crowdsourced platforms relative to benchmark samples, comparisons of the various platforms to each other (rather than to a benchmark probability sample) suggest general similarities. For example, Peer et al. (2017) find that samples recruited via ProA are similar to MTurk samples in age and education. Posch et al. (2019) find samples of Americans obtained via CF are also similar to MTurk: both had a high proportion of women, as well as people who are unemployed, unmarried, young, and well-educated.7 Beyond the general demographic characteristics of crowdsourced samples, there is 7 Posch et al. (2019) used CF samples from 10 different countries, but these comparisons are not to country benchmarks.

another concern about recruitment. Even if a random sample of all of the crowdsourced participants on a given platform could somewhat reflect a general population sample recruited by a survey company, any single study may not necessarily produce a random sample of crowdsourced participants. Since crowdsourced participants can pick and choose which studies they take – or may simply log on to the crowdsourcing website at arbitrary points in time – the quality of a crowdsourced sample may vary from study to study (Chandler et al. 2014). Exploring the possibility that a crowdsourced sample may vary by the day of the week, the time of the day at which a study is posted, the “serial position” (i.e., how new the task is relative to other tasks posted), and payment, Casey et al (2017) conducted a study in which the same MTurk survey was posted at three different points in the day for 56 days. Across all of the posted studies they take 403 demographic measures, finding differences in 33 measures due to either day of the week, time of the day, or “serial position.” Although the day of the week on which the study is posted has smaller effects on the sample composition, the time of day at which the study was posted and the serial position have larger effects.8 The demographic characteristics of crowdsourced samples – combined with the possibility that the time of the day alone could influence sample quality – may seem like a cause for alarm. Yet the extent of concern should depend on what Coppock and McClellan (2019) term a “fitfor-purpose framework” (p. 2). Certainly, 8 The 33 affected comparisons were as follows. Affected by time of day: time zone of participants, worker experience, percentage completed by smartphone, relationship status, Human Intelligence Task (HIT) obtained on a forum (rather than MTurk platform), percentage Asian Americans, and worker conscientiousness. Affected by day of the week: HIT obtained on a forum (rather than MTurk platform), age, and employment status. Serial position effects: worker experience, emotional stability, age, conscientiousness, agreeableness, employment status, household size, sex, percentage Asian American, and HIT obtained on a forum (rather than MTurk platform). Pay effects: worker experience and emotional stability. See table 5 of Casey et al (2017).

Convenience Samples in Political Science Experiments

these demographic patterns suggest that crowdsourced samples should not be used for the purpose of descriptive inferences about a general population, but perhaps there is less reason for concern when the research goal is to consider treatment effects or relationships between variables (Coppock and McClellan 2019). Empirical studies find that experimental results across a variety of contexts generally replicate with crowdsourced samples (Coppock 2019; Mullinix et al. 2015). Even the ideological uniqueness of some crowdsourced samples may not mean that they are invalid sources of data for political scientists conducting experiments. Delving more deeply into the psychological underpinnings of belief systems, for example, Clifford et al. (2015) find that the platform’s ideological uniqueness does not render it invalid in studies of ideology. Even more reassuring is that Levay et al. (2016) find that MTurk workers do not vary from other US samples on some unmeasurable characteristics. In perhaps the most thorough approach to the question, Coppock et al. (2018) replicate, with MTurk participants, 27 different survey experiments (including studies from political science, psychology, sociology, and business) originally conducted on probability samples in the USA. In 25 of 27 experiments, they fail to reject the null of no difference in the results – meaning that the results obtained with the probability sample are quite similar to those obtained with MTurk. Key to replication – versus a failure to replicate – is the general homogeneity of the treatment effect and that “any effect heterogeneity is orthogonal to sample selection” (p. 12445). They conclude: “Our results indicate that even descriptively unrepresentative samples constructed with no design-based justification for generalizability still tend to produce useful estimates not just of the [sample average treatment effects] but also of subgroup [conditional average treatment effects] that generalize quite well” (p. 12445).9 9 There does not appear to be a clear pattern to the studies that Coppock et al. (2018) do not replicate successfully. One of the studies is in international law (international relations subfield); another deals with

173

9.4.1.2 non-naivete of subjects Beyond the demographics of the sample, however, there are other aspects of crowdsourced samples that may be concerning. People who are part of crowdsourcing platforms may be especially likely to have experience with experimental research – making them non-naive subjects. If experimental tasks often rely on naive participants (Vinacke 1954), experienced participants may respond differently to treatments (Edlund et al. 2009). There are some reasons to believe that crowdsourced samples may be less naive than other types of samples. First, participants are not limited in the number of studies they can take. As Ipeirotis (2010) suggests, some members of MTurk may consider participation in crowdsourced tasks – including academic studies – a form of employment. Comparatively, undergraduate participants are typically part of a subject pool for a more limited set of academic terms and take part in a smaller number of experiments (Chandler et al. 2014). Focusing directly on the frequency of participation, Chandler et al. (2014) rely on a set of 132 research tasks to consider MTurk member frequency of participation in surveys. Each of these tasks contained numerous participation opportunities, and if each of these participation opportunities was taken by a unique MTurk member, then these 132 posted tasks would have yielded 16,408 different participants. Yet, Chandler et al. (2014) find that these posted tasks were completed by only 7498 unique participants. Many of the participants in the studies were duplicate respondents: the “most prolific 10%,” they write, were responsible for 41% of the completed tasks (p. 114). Overall, Berinsky et al. (2012) also find this type of repetition across six different studies posted to MTurk: political polarization in American politics (American politics, political behavior subfield). However, the experiment that Boas et al. (2020) cannot replicate successfully with convenience samples in their study deals with military interventions (again, international relations). It is possible that there are subfield differences underlying cases when crowdsourced samples produce different treatment effects than probability samples, but to date there is no systematic consideration of this possibility.

174

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

24% of recruited MTurk participants took part in two or more studies of the possible six. This pattern is especially notable given more recent research suggesting fraudulent and “trolling” responding (Ahler et al. 2019). In sum, while crowdsourcing websites may boast large numbers of possible participants, not all members of crowdsourcing websites are equally likely to participate.10 This may be cause for concern because it may mean that some participants are more likely to be able to find research studies (which may be easier and pay more than other types of tasks) on crowdsourcing websites – increasing the concerns about self-selection (Chandler et al. 2013). While this is a concern, the motivating force in this self-selection appears to be the desire to make money more quickly, rather than a specific interest in academic research or a particular academic topic. This may still differentiate participants, but it may not be as detrimental as other opt-in motivations. Yet, even if one assumes that the self-selection concerns are somewhat minimal, the sheer act of taking numerous studies makes these crowdsourced subjects far from naive. While the potential non-naivete of crowdsourced participants may intuitively seem worrisome, empirical results on the matter are more mixed. Chandler et al. (2014) consider cognitive reflection tasks (CRTs) – alternating between tasks that have frequently appeared in studies posted on MTurk and those that have not appeared in the past. They find that prior experience does predict answers on the common CRTs, but the number of studies taken in the past does not predict answers to the novel CRTs. Critically, however, performance on all CRTs was correlated, suggesting that the measures were generally reliable. While prior participation can make participants more experienced in answering the same, repeated questions, this possibility does not doom the crowdsourcing platform.

10 Although a researcher can track the number of times that a particular MTurk member has taken one of their studies, it is difficult to track an MTurk member’s participation in studies posted by other researchers.

Moreover, if the frequency of participation is a concern, then it should be a concern that is not limited to crowdsourced samples. Given that more expensive companies that field online surveys (e.g., YouGov, GfK, AmeriSpeak) do rely on empaneled participants, the possibility of repeat participation remains present. Also, while Chandler et al. (2014) suggest that repeated participation can cause a learning effect on questions that appear frequently, repeated participation does not seem to teach participants about research as an enterprise. Notably, for example, Thomas and Clifford (2017) find that crowdsourced participants are just as likely to trust researchers and “buy in” to interactive experiments as participants in laboratory studies.

9.4.1.3 response inconsistencies Recently, crowdsourced samples have led to a new set of worries. In a series of samples recruited via MTurk, researchers noted “nonsensical responses to open-ended questions, random answers to experimental manipulations, and suspicious responses to demographic questions” (Kennedy et al. 2019, pp. 1–2). Investigations by scholars suggest that the possible causes of these low-quality responses were attempts by international respondents to mask their locations and take part in American studies by using virtual private servers (VPSs) (Kennedy et al. 2019). While this type of low-quality responding seemed to spike in the summer of 2018 (Dennis et al. 2019), there is evidence to suggest that VPS-based responding has been part of MTurk data since at least 2013 (Kennedy et al. 2019). Comparing fraudulent responses to valid responses in an experiment, Kennedy et al. (2019) find that the presence of VPS responses did not bias treatment effects, though the treatment effects were larger among the valid respondents. Similarly, Ahler et al. (2019) find that fraudulent responding – be it through VPSs or respondents engaged in extreme satisficing – attenuated treatment effects in experiments by 9%. Ahler et al. (2019) warn of the possibility that up to a

Convenience Samples in Political Science Experiments

quarter of the data obtained via MTurk may not be trustworthy. Still, some types of problematic respondents do leave traces that allow them to be identified and excluded from data analysis (Ahler et al. 2019; Kennedy et al. 2019) – a point we will return to in the next section when we consider best practices for using crowdsourced data. This is not to suggest that scholars should be unconcerned about fraudulent responding, but to note that the possibility of VPS responses alone should not dissuade researchers from relying on crowdsourced data (nor should it be used as a heuristic to reject research that relies on crowdsourced data). While scholars can identify “nonsensical” VPS responses, low-quality responses produced by inattentive respondents on crowdsourced platforms pose a greater challenge. As Ahler et al. (2019) estimate, 5–7% of participants in the samples they recruited via MTurk were engaged in “trolling” – offering deliberately insincere responses to be “provocative, inflammatory, or humorous” (Lopez and Hillygus 2018, p. 6) – or satisficing. To this extent, then, the possibility that crowdsourced participants are less likely to produce high-quality data – even when they are valid, non-VPS respondents – would be a cause for concern. While Ahler et al. (2019) demonstrate some evidence of low-quality data (“trolling”), others suggest a more positive perspective of crowdsourced participants. Focusing on over-time responding, Johnson and Ryan (2020) find high levels of consistency in MTurk participants’ responses over a period of months and even years. In turn, these responses correlate with outcome measures in much the same way that prior literature would predict. In measures of political knowledge, Clifford and Jerit (2016) find that MTurk respondents are less likely to cheat than participants from survey company samples. MTurk participants are just as likely to pass attention checks as other samples (lab and online) (Paolacci et al. 2010) and may even be more likely to pass attention checks than either student participants or those recruited

175

by survey companies (Berinsky et al. 2012; Thomas and Clifford 2017).11 Further, Lopez and Hillygus (2018) report levels of trolling in surveys that do not rely on crowdsourced samples similar to those reported in Ahler et al. (2019), suggesting that trolling may be a survey respondent issue, rather than a crowdsourced sample issue. 9.4.2 Best Practices for the Use of Crowdsourced Convenience Samples The potential for possible pitfalls suggests that scholars should be more deliberate in setting up a series of checks for studies with crowdsourced samples. First, we note that scholars should consider whether the more peculiar demographic structure of crowdsourced samples may be particularly detrimental for their study. In particular, is the overrepresentation of certain types of people and underrepresentation of others likely to threaten generalizability when measuring treatment effects? Some attempts could be made to adjust demographic characteristics. Scholars may rely on screening questions to obtain very specific samples from crowdsourced websites (a sample of only Twitter users, a sample of only parents, a sample of only college graduates, etc.). In these cases, people who answer appropriately to the screening questions are included in the study, and those who do not are told that they cannot participate. Researchers, however, suggest that scholars should be careful with screeners: Chandler and Paolacci (2017) found that making screeners “blatant” (e.g., telling MTurk participants that only people with certain characteristics were eligible to participate) led participants to lie about having a particular characteristic in order to continue with the study. Second, research on participant nonnaivete suggests that scholars who are fielding numerous studies should, perhaps, rely on different types of measures to consider participant characteristics. As Chandler et al. (2013) show, MTurk participants eventually learn how to handle certain measures 11 The survey companies in question are YouGov and Dynata (formerly Survey Sampling International).

176

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

that they see frequently, suggesting that consistent reliance on “classic” measures could eventually prove problematic. That being said, Johnson and Ryan (2020) show that even if these learning effects do occur, the measures do not seem to lose validity. Scholars who are posting the same study multiple times, however, may face a more unique challenge in that they may obtain multiple responses from the same participants. This means that they may want either to post at different points in the day (Casey et al 2017) or to take other steps on crowdsourcing platforms to limit repeated participation. Third, it seems important to include measures to check for VPS responding and more general forms of inattention (Berinsky et al. 2016, as well as discussion in Chapter 12 in this volume). Given scholarly interest, there are numerous ways to either include checks that can immediately exclude VPS responses (Winter et al. 2019) or check for VPS responding post-data collection (Kennedy et al. 2019). Further, attention checks allow for additional means of addressing inattentive – but human – respondents (Berinsky et al. 2016). To this end, Thomas and Clifford (2017) review numerous approaches to determining whether crowdsourced participants are inattentive or otherwise problematic. Moreover, Thomas and Clifford (2017) argue that the exclusion of respondents who fail checks is likely to improve the validity of analyses; this approach can be anticipated so that scholars who plan to do so ensure that these exclusions are not based on posttreatment attention checks (see Montgomery et al. 2018). Integrating these various points, Ahler et al. (2019) include a thorough list of practical approaches – from retaining IP addresses to adjusting compensation – for avoiding both fraudulent and highly inattentive crowdsourced participants. 9.4.3 The Ethics of Using Crowdsourced Data When dealing with crowdsourced data, researchers engage with subjects directly –

there is no survey company or subject pool manager functioning as a go-between to ensure that all study participants are treated fairly. But in the quest for data quality, scholars may forget that crowdsourced platforms are composed of real people who should be treated ethically. Delving more deeply into the type of people who are members of a crowdsourcing platform, for example, Williamson (2016) discusses the types of life and family circumstances that lead these individuals to seek funds through participation in crowdsourced tasks. What this means is that it is up to the individual researcher to engage in the best ethical practices for dealing with crowdsourced participants (Buhrmester et al. 2018). We note that even if the practice is approved by an institutional review board, it does not necessarily render it ethical within the context of a particular crowdsourcing community. Steps such as unpaid screeners, for example, may be convenient for the researcher, but they may create tremendous difficulties for members of crowdsourced platforms. Because crowdsourced participants depend on these tasks to earn funds (Williamson 2016), it is unethical to ask participants take part in uncompensated lengthy screening questionnaires, since they may spend a good deal of time only to learn that they do not qualify for a study. Similarly unethical is using a crowdsourcing platform’s block function to bar participants who have done nothing wrong for the sheer convenience of obtaining unique participants across studies. Barring participants affects their overall reputation scores, limiting their ability to participate in other tasks and earn money. Other methods of ensuring unique samples, while less convenient, may be more ethical. Broadly, scholars should engage in fair compensation for study participation. Williamson (2016) discusses how researchers should handle compensation, noting that – at the very least – they should meet minimum wage requirements. If researchers, as Williamson (2016) writes, are concerned that higher payments on certain studies may “bias the pool” of crowdsourced workers,

Convenience Samples in Political Science Experiments

she suggests offering retroactive bonuses for participation (if the crowdsourced platform allows for this method) to increase compensation. Offering people more money for study completion does not seem to have much effect on data quality (Andersen and Lau 2018), but it is certainly the more ethical approach to using crowdsourced samples.

9.5 Other Types of Convenience Samples Undergraduate samples and crowdsourced samples do form the bulk of convenience samples used in recently published political science experiments. Still, given our definition, there are other types of samples that may be classified as convenience samples. While there is less systematic empirical work on these types of samples (at least compared to the research on undergraduate and crowdsourced samples), recent work suggests the extent to which these other types of convenience samples may affect the inferences scholars can make from these data. We turn first to sources of convenience samples that are not undergraduate students or crowdsourcing platforms. Second, we explore field and audit studies as experiments that are, by definition, convenience sample studies (for more on audit studies, see Chapter 3 in this volume). 9.5.1 Alternative Sources of In-Person Convenience Samples Scholars who conduct laboratory experiments may not always want to rely on undergraduate participants. A study conducted in the lab, however, is still geographically bound. To this end, researchers have turned to local residents and university employees as sources of data (Kam et al. 2007). These nonstudent participants can be more diverse than undergraduates on a variety of characteristics and, as Kam et al. (2007) find, those working on campus are, at the very least, a reasonable approximation of the local community. Clifford and Jerit (2016) further compare a convenience sample of university staff to other samples, including crowdsourced

177

samples and those recruited by survey firms. In studies conducted online, they find that samples of campus staff behave generally like those recruited via survey companies. Campus staff are, for example, more likely to cheat on knowledge measures than MTurk participants, but less likely to do so than undergraduate students. We note, however, that the convenience of these types of convenience samples may be more limited. Kam et al. (2007), for example, report a 24.3% response rate for contacted campus staff for a sample of n = 109, and an 11.9% response rate for local residents for a sample of n = 115. They also note that each of these participants was paid $30 for study completion. Clifford and Jerit (2016) paid their campus staff participants $25 for a sample of n = 81, relying on a campus-wide staff email seeking participants. If researchers often turn to convenience samples for lowercost alternatives, then campus staff or local adult samples do not seem to provide that type of convenience.12 Still, the use of these samples may be the only reasonable option for researchers who want to conduct studies in the lab but would prefer not to rely on an undergraduate sample (see Chapter 5 in this volume on laboratory studies conducted outside of campus settings). As Kam et al. (2007) write, there are certain characteristics on which the local and campus staff samples differ from the undergraduate students – but there are other politically relevant characteristics on which the samples (at least in their case) were strikingly similar. “This difference in the distributions for key covariates,” as Kam et al. (2007) write, “might or might not be consequential – it depends on the research question” (p. 428). 9.5.2 Alternative Sources of Online Convenience Samples Crowdsourcing platforms are likely the easiest means of obtaining online convenience 12 The total cost for the Kam et al. (2007) study appears to be $6720 and the total cost for the Clifford and Jerit (2016) study is $2025. Comparatively, in 2019, a study run by Dynata, a survey company that aims to recruit national samples, cost $3250 for about 800 participants.

178

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

samples, in large part because their infrastructure is designed with these types of research tasks in mind (indeed, ProA is deliberately designed for crowdsourcing research). Aside from crowdsourcing platforms, however, researchers can turn to other means of recruiting convenience samples online, such as social media websites and forums.13 Much of the research of non-crowdsourced convenience sample recruitment has focused on Facebook. In one of the most thorough investigations of the topic, Boas et al. (2020) compared users recruited via Facebook to MTurk, a non-probability sample, and a probability sample. They find Facebook users to be generally less cooperative than MTurk users.14 Moreover, there is some evidence that Facebook samples are demographically farther from probability samples than those recruited via MTurk. In particular, Boas et al. (2020) find that Facebook produced samples that were more politically knowledgeable and engaged than a probability sample (even to a higher degree than MTurk). They note that this may be due to Facebook’s algorithm, which deliberately targeted their study to people who may be more interested in politics. Still, relative to a benchmark study conducted with a survey company, Boas et al. (2020) were able to replicate two out of the three survey experiments (originally published in political science journals) using both MTurk and Facebook data. Still, they caution about the generalizability of the treatment effects obtained with convenience samples and note that while the patterns of treatment effects replicate, the point estimates from the convenience samples were often much smaller than those from the benchmark samples. The possibility that the Facebook algorithm can, in some sense, bias the recruited

13 Undergraduate and campus staff participants could also form online samples – one may, for example, send invitations to these two groups with a URL to an online study. Still, what distinguishes the samples we discuss here is that interaction is only possible online. 14 Boas et al. (2020) consider both American and Indian samples.

sample by strategically directing surveys is something that other researchers echo. Kapp et al. (2013) note that Facebook studies rely on ads, which means that researchers must balance ads that seem enticing enough to ensure participation but not so enticing that they bias the sample. Perhaps the most effective use of the Facebook convenience sample, then, is Ryan (2012), who does not rely on the website to recruit a sample, but treats the very ads he posts for recruitment purposes as a manipulation in and of itself to study attention to political stimuli. In all, it is not immediately clear given the research on the various platforms why scholars would turn to Facebook for an online convenience sample rather than relying on a crowdsourcing platform. The Facebookrecruited participants have limitations that are similar to crowdsourced participants and, given Facebook algorithms, may be much more difficult to recruit without bias (although see Chapter 10 in this volume for various other uses of social media communities for experiments).

9.6 Field Experiments and Audit Studies as Convenience Samples Field experiments and audit studies provide scholars with the opportunity to study behavior outside of the lab (or survey) context (Gerber 2010; Gerber and Green 2008). While these studies have the tremendous potential to overcome the limitations of both survey and laboratory experiments, as well as observational data, it is possible that one may consider the participants in many field and audit studies as convenience samples. If the definition of convenience samples is non-probability samples that are recruited for the ease of their study participation, then field experiments and audit studies could be defined as “convenience samples.” We note, however, that this is a gray area. Grose (2014), for example, writes that many field experiments are conducted on the New Hampshire state legislature because it is one of the largest state legislatures. Does a study that relies on

Convenience Samples in Political Science Experiments

the New Hampshire legislature then, by definition, rely on a convenience sample? If the goal is to draw inferences only about the New Hampshire government, then the answer is likely no; but if the goal is to draw inferences about state legislatures in general, then an argument could be made that the use of one legislative body forms a convenience sample. Similar arguments can be made about field experiments on individual behavior. Often, these types of experiments rely on partnerships between researchers and an organization (Nickerson and Hyde 2016; see Chapter 11 in this volume), which means that the researcher is reliant on the type of people that an organization is most interested in reaching. In Krupnikov and Levine (2019), for example, the field experiment samples are women in a northeastern state because their partnering organization believed this group to be its likely donors. In Mann (2010), the field experiment participants are unmarried women, again due to the wishes of a partnering organization. Does this make the samples convenience samples? Definitionally, the answer may be yes, though we believe a reasonable argument could be made that samples recruited purely for researcher convenience and the samples in field experiments reflect very different goals. Using field experiments, scholars are attempting to assess the average effect of an intervention. Often, this is done by partnering with an organization that is already interested in studying a given intervention (see Chapter 11 in this volume on partnerships). In turn, the samples provided by partnering with organizations reflect the types of people who would have been most likely to have received the intervention in the first place, making the sample a convenience sample. On the other hand, when scholars are not partnering with organizations – for example, when they are conducting audit studies (see Chapter 3 in this volume) – the characterization of the sample may depend on the definition of the population. In other words, whether the sample is a convenience sample or a probability sample depends on the researcher’s goal regarding study generalizability.

179

9.7 Conclusions While probability samples are, generally, the gold standard for research on individual behavior in political science (Mutz 2011), convenience samples also have an important role in experimental research. Studies that, by necessity, must be conducted in the laboratory – for example, studies that require neuroimaging (e.g., Nam et al. 2018), or psychophysiological equipment (Soroka and McAdams 2015) or studies of in-person social interactions (e.g., Klar 2014) – often must rely on convenience samples due to participant travel constraints. In other cases, scholars turn to convenience samples due to cost constraints: an experiment with undergraduates or crowdsourced participants can be conducted at a lower cost than one using participants recruited by a survey company. In sum, convenience samples provide a source of data for scholars who otherwise would be unable to conduct experimental research in political science. There are, of course, concerns about the use of convenience samples. Convenience samples may have less generalizability when scholars aim to make inferences about the population; these samples may not only be demographically different from a random sample of the population, but may also overrepresent the types of people who have (at times unmeasurable) qualities that lead them to be overresponsive or underresponsive to experimental stimuli. Scholars may also need to worry about data quality, especially if they are attempting to recruit online. Having noted these concerns, it is important to underscore that much of the empirical research on the use of convenience samples suggests that the results obtained using these samples often replicate the results obtained with probability samples. Certainly, they should not be used for descriptive inferences about some target population, but convenience samples can provide useful evidence about experimental treatment effects. Experiments conducted on convenience samples replicate effects obtained with “gold-standard” samples

180

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

(Coppock et al. 2018); even when samples seem to differ on politically important factors (e.g., ideology), convenience samples often prove reasonable (Clifford et al. 2017). In short, published papers on the use of MTurk and student subjects give little reason to treat the use of convenience samples as a heuristic by which to reject research. We note, however, that it is plausible that the pattern of published papers on this topic reflects a bias wherein papers that demonstrate the successful replications of experiments with convenience samples are more likely to pass peer review (see Chapter 19 in this volume on publication bias). This is not to argue that reliance on convenience samples should not be considered with caution. Indeed, even scholars who replicate experiments with convenience samples suggest that scholars should consider the use of these samples with care (Boas et al. 2020; Coppock 2019). It is critical to consider, then, whether one’s sample differs on covariates that are pivotal to the theoretic premises being tested (e.g., Druckman and Kam 2011). These types of questions, however, are something that every scholar should undertake before conducting experimental research – the recruitment and cost of a sample cannot, on its own, serve as a signal of its usefulness for the research at hand. Moreover, it is possible that an expensive sample that is purchased from a well-known survey company is less generalizable than a researcher hopes due to declining response rates (Leeper 2019). Rather, the relative quality of a sample is particular to the research questions and hypotheses; this is not only something that researchers should consider in their own work, but also something to keep in mind when evaluating the work of others.

References Ahler, Douglas, Carolyn Roush, and Gaurav Sood. 2019. “The Micro-Task Market for Lemons: Data Quality on Amazon’s Mechanical Turk.” Working paper. URL: www.gsood .com/research/papers/turk.pdf

Alwin, Duane F., and Jon A. Krosnick. 1991. “Aging, Cohorts, and the Stability of Sociopolitical Orientations over the Life Span.” American Journal of Sociology 97(1): 169–195. Andersen, David, and Richard Lau. 2018. “Pay Rates and Subject Performance in Social Science Experiments Using Crowdsourced Online Samples.” Journal of Experimental Political Science 5(3): 217–229. Berinsky, Adam J., Gregory A. Huber, and Gabriel S. Lenz. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.Com’s Mechanical Turk.” Political Analysis 20(3): 351–368. Berinsky, Adam J., Michelle Margolis, and Michael Sances. 2014. “Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Self-Administered Surveys.” American Journal of Political Science 58: 739–753. Boas, Taylor, Dino Christenson, and David Glick. 2020. “Recruiting Large Online Samples in the United States and India: Facebook, Mechanical Turk, and Qualtrics.” Political Science Research and Methods 8(2): 232–250. Buhrmester, Michael D., Sanaz Talaifar, and Samuel D. Gosling. 2018. “An Evaluation of Amazon’s Mechanical Turk, Its Rapid Rise, and Its Effective Use.” Perspectives on Psychological Science 13(2): 149–154. Casey, Logan. S., Jesse Chandler, Adam S Levine, Andrew Proctor, and Dara Strolovitch. 2017. “Intertemporal Differences Among MTurk Workers: Time-Based Sample Variations and Implications for Online Data Collection.” SAGE Open. DOI: 10.1177/2158244017712774. Chandler, Jesse, and Gabriele Paolacci. 2017. “Lie for a Dime: Most Pre-Screening Responses are Honest but Most Participants are Imposters.” Social Psychological and Personality Science 8(5): 500–508. Chandler, Jesse, Pam Mueller, and Gabriele Paolacci. 2014. “Nonnaive among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers.” Behavioral Research Methods Science 46: 112–130. Clifford, Scott, and Jennifer Jerit. 2014. “Is There a Cost to Convenience? An Experimental Comparison of Data Quality in Laboratory and Online Studies.” Journal of Experimental Political Science, 1(2): 120–131. Clifford, Scott, Ryan Jewell, and Philip D. Waggoner 2015. “Are Samples Drawn from

Convenience Samples in Political Science Experiments Mechanical Turk Valid for Research on Political Ideology?” Research & Politics. DOI: 10.1177/2053168015622072. Coppock, Alexander. 2019. “Generalizing from Survey Experiments Conducted on Mechanical Turk: A Replication Approach.” Political Science Research and Methods 7(3): 613–628. Coppock, Alexander, Thomas J. Leeper, and Kevin J. Mullinix. 2018. “Generalizability of Heterogeneous Treatment Effect Estimates Across Samples” Proceedings of the National Academy of Sciences of the United States of America 115(49): 12441–12446. Coppock, Alexander, and Oliver McClellan. 2019. “Validating the Demographic, Political, Psychological, and Experimental Results Obtained from a New Source of Online Survey Respondents.” Research and Politics. DOI: 10.1177/2053168018822174. Corrigan, Patrick W., Andrea B. Bink, J. Konadu Fokuo, and Annie Schmidt. 2015. “The Public Stigma of Mental Illness Means a Difference Between You and Me.” Psychiatry Research 226: 186–191. Dennis, Sean A., Brian Matthew Goodson, and Chris Pearson. 2019. “Virtual Private Servers and the Limitations of IP-Based Screening Procedures: Lessons from the MTurk Quality Crisis of 2018.” URL: https://ssrn.com/ abstract=3233954 Druckman, James N., and Cindy D. Kam. 2011. “Students as Experimental Participants: A Defense of the ‘Narrow Data Base.”’ In Cambridge Handbook of Experimental Political Science, eds. James N. Druckman, Donald P. Green, James H. Kuklinski, and Arthur Lupia. Cambridge, UK: Cambridge University Press, pp. 41–57. Edlund, John E., Brad J. Sagarin, John J. Skowronski, Sara J. Johnson, and Joseph Kutter. 2009. “Whatever Happens in the Laboratory Stays in the Laboratory: The Prevalence and Prevention of Participant Crosstalk.” Personality and Social Psychology Bulletin 35: 635–642. Falk, Armin, Stephan Meier, and Christian Zehnder. 2013. “Do Lab Experiments Misrepresent Social Preferences? The Case of Self-Selected Student Samples.” Journal of the European Economic Association 11(4): 839–852. Ford, John B. 2017. “Amazon’s Mechanical Turk: A Comment.” Journal of Advertising 46(1): 156–158. Franco, Annie, Neil Malhotra, Gabor Simonovits, and L. J. Zigerell. 2017. “Developing Standards for Post-Hoc Weighting in Population-Based

181

Survey Experiments” Journal of Experimental Political Science 4(2), 161–172. Gelman, Andrew. 2013. “Don’t Trust the Turk” Monkey Cage Blog. URL: http:// themonkeycage.org/2013/07/dont-trustthe-turk/ Gerber, Alan. 2011. “Field Experiments in Political Science.” In Cambridge Handbook of Experimental Political Science, eds. James N. Druckman, Donald P. Green, James H. Kuklinski, and Arthur Lupia. Cambridge, UK: Cambridge University Press, pp. 115–138. Gerber, Alan, and Donald P. Green. 2008. “Field Experiments and Natural Experiments.” Oxford Handbook of Political Methodology, eds. Janet M. Box-Steffenmeier, Henry E. Brady, and David Collier. Oxford: Oxford University Press, pp. 357–382. Grose, Christian R. 2014. “Field Experimental Work on Political Institutions.” Annual Review of Political Science 17(1): 355–370. Harms, P. D., and Justin A. DeSimone. 2015. “Caution! MTurk Workers Ahead – Fines Doubled.” Industrial and Organizational Psychology 8(2): 183–190. Henry, P. J. 2008. “College Sophomores in the Laboratory Redux: Influences of a Narrow Data Base on Social Psychology’s View of the Nature of Prejudice.” Psychological Inquiry 19(2): 49–71. Highton, Benjamin. 2009. “Revisiting the Relationship between Educational Attainment and Political Sophistication.” Journal of Politics 71(4): 1564–1576. Hovland, Carl. 1959. “Reconciling Conflicting Results Derived from Experimental and Survey Studies of Attitude Change.” American Psychologist 14(1): 8–17. Huff, Connor, and Dustin Tingley. 2015. “‘Who Are These People?’ Evaluating the Demographic Characteristics and Political Preferences of MTurk Survey Respondents.” Research and Politics. DOI: 10.1177/2053168015604648. Hyde, Susan, and David Nickerson. 2016. “Conducting Research with NGOs: Relevant Counterfactuals from the Perspective of Subjects.” In Ethics and Experiments: Problems and Solutions for Social Scientists and Policy Professionals, ed. Scott Desposato. Abingdon: Routledge, pp. 198–216. Ipeirotis, Panagiotis G. 2010. “Demographics of Mechanical Turk.” NYU Working Paper No. CEDER-10-01. URL: https://ssrn.com/ abstract=1585030 Iyengar, Shanto. 2011. “Laboratory Experiments in Political Science Cambridge Handbook of

182

Yanna Krupnikov, H. Hannah Nam, and Hillary Style

Experimental Political Science.” In Cambridge Handbook of Experimental Political Science, eds. James N. Druckman, Donald P. Green, James H. Kuklinski, and Arthur Lupia. Cambridge, UK: Cambridge University Press, pp. 73–88. Jennings, M. Kent, and Gregory B. Markus. 1984. “Partisan Orientations over the Long Haul: Results from the Three-Wave Political Socialization Panel Study.” American Political Science Review 78(4): 1000–1018. Jennings, M. Kent, and Richard G. Niemi. 1981. Generations and Politics: A Panel Study of Young Adults and Their Parents. Princeton, NJ: Princeton University Press. Johnson, David Blake, and John Barry Ryan. 2020. “Amazon Mechanical Turk Workers Can Provide Consistent and Economically Meaningful Data.” Southern Economic Journal. DOI: 10.1002/soej.12451. Kahan, Daniel. 2013. “Fooled Twice, Shame on Who? Problems with Mechanical Turk Study Samples, Part 2.” Cultural Cognition Blog. URL: www. culturalcognition. net /blog /2013 /7 /10 / fooled -twice -shame -on-who-problems-withmechanical-turk-stud.html Kam, Cindy D., Jennifer Wilking, and Elizabeth Zechmeister. 2007. “Beyond the “‘Narrow Data Base’: Another Convenience Sample for Experimental Research.” Political Behavior 29: 415. Kennedy, Ryan, Scott Clifford, Tyler, Burleigh, Ryan Jewell, Philip Waggoner, and Nicholas Winter. 2019. “The Shape of and Solutions to the MTurk Quality Crisis.” URL: https://ssrn .com/abstract=3272468 Klar, Samara. 2014. “Partisanship in a Social Setting.” American Journal of Political Science 58: 687–704. Krupnikov, Yanna, and Adam Seth Levine. 2019. “Political Issues, Evidence, and Citizen Engagement: The Case of Unequal Access to Affordable Health Care.” Journal of Politics 81(2): 385–398. Leeper, Thomas. 2019. “Where Have the Respondents Gone? Perhaps We Ate Them All.” Public Opinion Quarterly 83: 280–288. Levay, Kevin. E., Jeremy Freese, and James N Druckman. 2016. “The Demographic and Political Composition of Mechanical Turk Samples.” SAGE Open. DOI: 10.1177/2158244016636433. Lopez, Jesse, and Sunshine D. Hillygus. 2018. “Why So Serious?: Survey Trolls and Misinformation.” Working paper. URL: https://ssrn .com/abstract=3131087

Lupton, Danielle L. 2018. “The External Validity of College Student Subject Pools in Experimental Research: A Cross-Sample Comparison of Treatment Effect Heterogeneity.” Political Analysis 27: 90–97. Mann, Christopher B. 2010. “Is There Backlash to Social Pressure? A Large-Scale Field Experiment on Voter Mobilization.” Political Behavior 32: 387. Mariani, Mack D., and Gordon J. Hewitt. 2008. “Indoctrination U.? Faculty Ideology and Changes in Student Political Orientations.” PS: Political Science & Politics 41(4): 773–783. Mullinix, Kevin J., Thomas J. Leeper, James N. Druckman, and Jeremy Freese. 2015. “The Generalizability of Survey Experiments.” Journal of Experimental Political Science 2(2): 109–138. Mutz, Diana. 2011. Population-Based Survey Experiments. Princeton, NJ: Princeton University Press. Nam, H. Hannah, John T. Jost, Lisa Kaggen, Daniel Campbell-Meiklejohn, and Jay J. Van Bavel. 2018. “Amygdala Structure and the Tendency to Regard the Social System as Legitimate and Desirable” Nature Human Behaviour 2: 133–138. Osborne, Danny, David O. Sears, and Nicholas A. Valentino. 2011. “The End of the Solidly Democratic South: The Impressionable-Years Hypothesis.” Political Psychology 32(1): 81–108. Palan, Stefan, and Christian Schitter. 2018. “Prolific.ac – A subject Pool for Online Experiments.” Journal of Behavioral and Experimental Finance 17: 22–27. Peer, Eyal, Laura Brandimarte, Sonam Samat, and Alessandro Acquisti. 2017. “Beyond the Turk: Alternative Platforms for Crowdsourcing Behavioral Research” Journal of Experimental Social Psychology 70: 153–163. Posch, Lisa Arnim Bleier, Fabian Flöck, and Markus Strohmaier. 2018. “Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics.” arXiv:1812.05948. Prior, Marcus. 2010. “You’ve Either Got It or You Don’t? The Stability of Political Interest over the Life Cycle.” Journal of Politics 72(3): 747–766. Prior, Marcus. 2019. Hooked: How Politics Captures People’s Interest. Cambridge, UK: Cambridge University Press. Russo, Silvia, and Ha◦ kan Stattin. 2017. “Stability and Change in Youths’ Political Interest.” Social Indicators Research 132(2): 643–658.

Convenience Samples in Political Science Experiments Ryan, Timothy J. 2012. “What Makes Us Click? Demonstrating Incentives for Angry Discourse with Digital-Age Field Experiments.” Journal of Politics 74(4): 1138–1152. Sears, David O. 1986. “College sophomores in the laboratory: Influences of a Narrow Data Base on Social Psychology’s View of Human Nature.” Journal of Personality and Social Psychology 51(3): 515–530. Sears, David O., and Nicholas A. Valentino. 1997. “Politics Matters: Political Events as Catalysts for Preadult Socialization.” American Political Science Review 91(1): 45–65. Shapiro, Danielle N., Jesse Chandler, and Pam A. Mueller. 2013. “Using Mechanical Turk to Study Clinical Populations.” Clinical Psychological Science 1(2): 213–220. Soroka, Stuart, and Stephen McAdams. 2015. “News, Politics, and Negativity.” Political Communication 32(1): 1–22. Strange, Austin, Ryan Enos, Mark Hill, and Amy Lakeman. 2018. “Intrinsic Motivation at Scale: Online Volunteer Laboratories for Social Science Research.” Working paper.

183

URL: https://scholar.harvard.edu/files/renos/ files/strangeenoshilllakeman.pdf Thomas, Kyle A., and Scott Clifford. 2017. “Validity and Mechanical Turk: An Assessment of Exclusion Methods and Interactive Experiments.” Computers in Human Behavior 77: 184–197. Vinacke, W. Edgar. 1954. “Deceiving Experimental Subjects.” American Psychologist 9(4): 155. Williamson, Vanessa. 2016. “On the Ethics of Crowdsourced Research.” PS: Political Science & Politics 49(1): 77–81. Winter, Nicholas, Tyler Burleigh, Ryan Kennedy, and Scott Clifford. 2019. “A Simplified Protocol to Screen Out VPS and International Respondents Using Qualtrics.” Working paper. URL: https://ssrn.com/abstract=3327274 Yeager, David S., Jon A. Krosnick, Penny S. Visser, Allyson L. Holbrook, and Alex M. Tahk. 2019. “Moderation of Classic social Psychological Effects by Demographics in the U.S. Adult Population: New Opportunities for Theoretical Advancement.” Journal of Personality and Social Psychology 117(6): e84–e99.

C H A P T E R 10

Experiments Using Social Media Data∗

Andrew M. Guess

Abstract Widespread use of social media platforms has generated an explosion of data available for use by political scientists. This chapter will outline the possibilities of social media data for experimental research in all domains. At a basic level, social media data can be useful for improving measurement and design in the study of classic theories. They also facilitates research into questions about politics and the Internet itself. Using a large Twitter field experiment as a running example, I will illustrate how social media platforms can be used to (1) recruit experimental subjects, (2) deliver treatments, and (3) collect outcomes. I suggest that these possibilities are especially promising for scholars interested in studying political mobilization and media effects. Finally, I discuss challenges and opportunities for using these techniques to explore peer effects and other network dynamics. Social media is an increasingly central arena for political communication, civic activism, fundraising, and other topics of interest to political scientists (Karpf 2013; Mutz and Young 2011). Beyond these well-established domains, new forms of activity on social media platforms are creating categories * I thank the editors of this volume, James N. Druckman and Donald P. Green, for their tremendously helpful guidance and support throughout the development of this chapter. Jen Pan and Alex Coppock generously provided thoughtful feedback on an earlier draft, which also benefited from countless discussions with other authors in this volume.

184

of expression and behavior that are only beginning to be understood as politically relevant (Settle 2018). Naturally, then, it is increasingly necessary to incorporate data collected on social platforms into studies designed to understand these processes (Steinert-Threlkeld 2017). At the same time, passively generated “digital traces” on social media can serve as improved measures of latent concepts that are useful for studying long-standing questions in political science. How do elected officials communicate with their constituents? Who sets the policy agenda (Barberá et al. 2019)?

Experiments Using Social Media Data

How can we place preferences of elites and the mass public on the same scale (Barberá 2015)? This chapter will focus on the potential uses of social media data in experiments for both sets of goals. Ultimately, as these kinds of data become more widely used, they may come to serve as a bridge by shedding light on the emerging relationship between online and offline behavior. This chapter is organized by the specific function social media data can have in experimental research designs: as a way to recruit subjects; to deliver treatments to those subjects; to collect survey or behavioral outcomes; and, finally, to study network dynamics such as peer effects. These capabilities create myriad possibilities for using social media as a mechanism for generating experimental data, whether or not the substantive question of interest is specific to social media. Finally, I discuss debates about whether it is ethical to conduct social media experiments and offer a specific recommendation for both researchers and platforms to help ensure a credible research agenda that does not require conducting onplatform experiments at all.

10.1 Subject Recruitment on Social Media One of the first design decisions for any experiment is to define the population and a procedure for recruiting a sample of participants. Several factors will inform this decision, ranging from the practical (cost, constraints from partner organizations) to the substantive (what is the target population for a given intervention?). These considerations carry over into the online world as well. As an illustrative example, consider an online field experiment conducted in the summer of 2016 (Guess et al. n.d.). The researchers partnered with a progressive advocacy group on a Twitter campaign designed to pressure Senate Republicans to hold a confirmation vote for Merrick Garland, President Obama’s Supreme Court

185

pick. Messages were sent to users who follow the organization’s Twitter account via Promoted Tweets (PTs), direct messages (DMs), or a combination of the two.1 All treatment messages urged recipients to follow a link to sign an online petition. The researchers collected several outcomes: petition signatures, tweets, and retweets. Given the partner organization’s focus on people likely to support their campaign and active on social media, defining the sample in this way was relevant to the research question. It was also convenient, since it is straightforward via the Twitter application programming interface (API)2 to scrape the followers of any public Twitter account. The researchers were able to do this within a reasonable amount of time given API rate limits. After excluding known journalists, large organizations with 20,000 or more followers, and others in the media industry, the sample was fixed at n = 13,559 before random assignment occurred.3 This experiment uses a particular kind of purposive sampling that is readily available to researchers designing studies using Twitter (e.g., Coppock et al. 2016). In the present case, the choice of subject pool was related to the interests of the partner organization. More generally, researchers may be interested in a particular subpopulation – for example, people who post about a political topic (see Chapter 9 in this volume). On Twitter, this is straightforward to do by searching for keywords or hashtags in the text of users’ tweets. In a particularly vivid example, Munger (2017) selects experimental subjects who engage in racist harassment 1 Using an auction-like mechanism, organizations or users can pay for exposures to PTs, which are the same as regular tweets but can be targeted to specific kinds of people and are displayed with a small “Promoted” label. For more information, see: https://business.twitter.com/en/help/overview/ what-are-promoted-tweets.html. 2 APIs provide a standardized framework allowing users and client-side apps to send server requests for data or other output. These interfaces have enabled large-scale, automated integration with digital platforms, survey providers, and other data services. 3 We excluded these accounts to ensure that our sample was composed to the greatest extent possible of typical politically engaged users.

186

Andrew M. Guess

as targets for social norm treatments from automated Twitter bots. Relatedly, Klar and Leeper (2019) have argued for the use of purposive sampling in experiments with intersectional identity groups. Focusing on subgroups or otherwise “difficult-to-reach” populations illustrates the particular advantages of Twitter, with its publicly observable activity and accessible search features. It also avoids the common pitfall of associating the general Twitter population with the population as a whole. The Pew Research Center reported that approximately 22% of American adults ever use Twitter (compared to 69% for Facebook and 73% for YouTube); in addition to overrepresentation of certain groups on the platform (those aged 18–29, men, those with a college education, liberals, African Americans, and Hispanics), there is a strong skew in who posts tweets: for example, those aged 65 or older are a small fraction of users but produce a third of tweets about national politics (Pew 2019a, 2019b). For some applications, however, researchers may be interested in sampling from the overall user base of a social media platform and can adapt their strategies accordingly. On Twitter, for instance, it is possible to approximate a random sample of the user population by randomizing account ID numbers (Barberá 2016). Zhang et al. (2020) demonstrate and validate a method for quota sampling using Facebook’s ads platform. This can be useful for researchers interested in running population experiments on a quasirepresentative Facebook sample. Some quasi-random sampling approaches were developed for survey-based research and could be adapted for experiments conducted on a social media population. Vaccari et al. (2015) identify Twitter users posting about politics in Italy and send them tweets with a link to an online survey. Rather than begin with Twitter data, another approach is to take an existing sample and link responses to Twitter accounts. Studies have started with voter file data, which were then linked to Twitter accounts via exact matching (Barberá 2015; Grinberg et al. 2019). This produces a sample that, conditional on known biases related to

the uniqueness of names, has representative characteristics and links together online and offline activity. Relatedly, it may be desirable to start with a survey whose respondents are sampled using a known procedure. In the running example introduced above, for instance, the researchers were able to link online petition signatures to tweet behavior by asking respondents to provide their Twitter usernames. Respondents may be willing to share this information with researchers: 61% of respondents to an unrelated US panel survey who said they have a Twitter account shared their Twitter handles, and 45% of respondents who said they have a Facebook account were willing to share profile information (Guess et al. 2019b). And while this kind of data collection will induce some sample-selection bias, there is reason to believe that rough correspondence to population benchmarks on observable demographics can be achieved (Vaccari et al. 2013). This general approach can be extended so that subjects with public social media accounts can first be recruited via representative surveys for later incorporation into experimental designs (see Chapter 4 in this volume). For instance, Bail et al. (2018) surveyed a sample of self-described frequent Twitter users and asked respondents to provide their usernames. This sample was then randomized into an online field experiment. Starting with a survey in this way has several advantages. First, depending on the sampling method, it can achieve a degree of representativeness. Second, researchers have access to standard individual-level covariates at baseline that would not readily be available when sampling directly from a platform. Third, as in the Bail et al. (2018) study, survey-based outcomes may be of interest to researchers. When this is the case, as Chapter 4 in this volume shows, there are substantial gains in statistical power, as well as the ability to mask researcher intent with “as-if-unrelated” surveys. When starting with a survey is not feasible or desirable, there are still methods available for imputing individual-level characteristics

Experiments Using Social Media Data

using publicly available social media data. A promising way to address this problem is to use supervised learning methods to predict the demographic and political characteristics of social media users using account and network features (e.g., Barberá 2016). Finally, recruiting experimental subjects on social media may serve as a useful substitute for standard respondent pools. As an example, it is possible that some research questions necessitate the inclusion of subjects across a range of characteristics, such as digital literacy, which may be of theoretical interest. This would rule out commonly used respondent pools such as Mechanical Turk, which require a baseline level of digital literacy to join. Munger et al. (n.d.) thus argue that using the Facebook Ads platform is critical to drawing a sufficient number of respondents at lower levels of digital literacy. There may be other characteristics of subjects that would make recruitment via Facebook or Twitter desirable for reasons not directly related to their activity on those platforms.

10.2 Delivering Treatments on Social Media Social media platforms offer a number of ways to deliver naturalistic treatments. In the case of Twitter, specific affordances and the relatively open nature of the platform create unique opportunities for experimental control. On Facebook, some researchers have cleverly leveraged their own social networks to create online field experiments (Haenschen 2016; Teresi and Michelson 2015). Aside from user-centric features, the advertising capabilities of both services also provide potential avenues for treating subjects. But perhaps the most straightforward approach is to apply a standard encouragement design to the social media context. This is vividly demonstrated by Bail et al. (2018), who randomly incentivize subjects recruited via survey to follow one of two Twitter accounts programmed to retweet posts by politically influential users. Subjects were

187

periodically quizzed about the contents of their Twitter feeds and surveyed again to gauge the effect of exposure to counterattitudinal social media content. The authors find a polarizing effect, concentrated mainly among Republican subjects exposed to left-leaning tweets who became more conservative on issue positions. By inducing people to follow accounts that they control, the researchers provide a proof of concept for future work examining the causal effects of exposure to different types of content on social media. But this elegant setup arguably comes at a cost in terms of the lengths to which the authors must go in order to ensure subjects’ attentiveness. Respondents assigned to the treatment group were asked to take surveys each week testing their familiarity with items posted by the researcher-controlled bots, including animal pictures retweeted by the accounts twice a day. Correct answers were rewarded with incentives. While this provides useful measures of treatment compliance, it is more obtrusive than typical approaches. This illustrates a trade-off inherent in research – in general, but especially on social media – between naturalism and strength of treatment. Like the offline world, online environments are crowded and multifaceted, with many competing demands on users’ attention. Thus, it may not be surprising that click-through rates on online advertisements are low or that a strong predictor of clicking a Facebook post is how highly ranked it is on a user’s News Feed (Bakshy et al. 2015). For experimental researchers, this implies that, at least in an intent-to-treat world, manipulating a single post, ad impression, or account exposure may not in itself be expected to produce measurably large effects. On Twitter, exposure rates may be particularly low: research suggests that as few as 5% of all tweets in a user’s feed may ever be seen (Wang et al. 2016). Even the relatively strong inducements to attentiveness in the Bail et al. (2018) experiment produce a rate of “full” compliance – as measured by performance on knowledge quizzes – at about 16%. At the other extreme, research on “fake news” suggests that 80% of exposure to this

188

Andrew M. Guess

type of online misinformation was concentrated among 1% of US voters on Twitter (Grinberg et al. 2019). Nudging subjects to be more attentive may involve an external validity trade-off, but in helping to tune out the background noise of endlessly scrolling feeds, it helps to guarantee a stronger dose of treatment. 10.2.1 Treatments on Twitter There are at least three different ways to use Twitter’s user capabilities to design treatments. These map onto three building blocks of the environment: DMs, tweets, and user accounts. Fortunately for researchers, Twitter has a robust, open, relatively accessible API that can be used to automate tasks and precisely control the implementation of research (Steinert-Threlkeld 2017). With a basic script, for example, tweets can be posted at timed intervals or in response to specific events. The basic challenge in designing field experiments on the Twitter platform is precisely this openness, in which the ability to publicly observe tweets creates the risk that subjects assigned to the control group still become exposed to treatment stimuli either directly or indirectly (e.g., because someone else reshared it). (On interference, see Chapter 16 in this volume.) DMs, while not the most obvious vehicle for treating subjects in a politically relevant randomized controlled trial, do have the advantage of not being publicly observable. These essentially provide private messaging functionality within the Twitter platform, not dissimilar from SMS or email. An early Twitter experiment used this functionality in partnership with the League of Conservation Voters to DM links to an online petition to followers of the organization’s official Twitter account (Coppock et al. 2016). Although this is not a commonly used technique to encourage participation in online campaigns, the experiment shed light on the mechanisms by which such private messages may work – in particular, via notifications that function in a way similar to email appeals. Leveraging the “social” aspect of this social platform requires circumventing the

observability of public tweets. It is possible to effectively do this via Twitter’s ads functionality, specifically PTs. Like similar online advertising products on Facebook and Google, PTs facilitate the surfacing of content via an online auction mechanism. PTs must themselves be actual tweets posted by the account of the user promoting them. They are seamlessly displayed in users’ timelines on the Twitter mobile or web interface like a standard tweet, but with a small “Promoted” tag underneath. Since a core feature of PTs is the ability to target specific users or categories of users, these can be shown to specific, randomly partitioned groups.4 In the running example, the same messages were sent to followers of the organization’s Twitter account via PTs, DMs, or a combination of the two. All treatment messages urged recipients to follow a link to sign an online petition. Figure 10.1 displays the results. We see that PTs were not at all effective, but that DMs boosted petition signatures by nearly 3 percentage points on average. Subjects assigned to receive both the private and public versions of the organization’s appeal (third row) were not measurably more likely to sign the online petition than the ones who only received the private DMs – a finding that does not support expectations of repeated exposure or multi-mode effects. Why would the same message on the same platform be much more effective when sent via DM rather than PT? One possibility is that private communications to specific recipients are more effective than public, generic appeals. This is consistent with a personalization mechanism, which has been shown in the email context to be effective at mobilizing prospective professional group members (Druckman and Green 2013). Alternatively, while the PT was targeted to subjects in the PT and combined treatment groups, there is no guarantee that it was successfully shown to all of them. The nature of online ad-auction 4 These groups can be defined by interests or by simply uploading lists of Twitter accounts or email addresses obtained by whatever means.

Experiments Using Social Media Data

189

Figure 10.1 Effect of exposure to Promoted Tweets, direct messages, or both on signing an online petition.

platforms such as Twitter’s is that campaigns boost the likelihood of exposure to a particular piece of content. This uncertainty comes on top of the already low exposure rates on the platform, as mentioned in the previous subsection. A final way of enlisting Twitter’s affordances in service of experimental research is to send tweets specifically directed at subjects. In principle, these treatments could be observed by subjects in the control group, creating the possibility of spillover effects (Gerber and Green 2012). If designed carefully, however, this risk can be minimized. As Bail et al. (2018) do in their study, experimental samples can be constructed to exclude subjects with direct network ties. Since, by default, conversations between users are not visible on one’s Twitter timeline unless all participants in the conversation are

being followed, this would make spillovers unlikely, although not impossible. In practice, the usefulness of targeting tweets at specific users has been demonstrated in the use of “bot” interventions to sanction norm violations. In the abovecited study on racist harassment, Munger (2017) identifies a set of users employing racial slurs on Twitter before randomly assigning them to receive tweets from bot accounts delivering anti-racist messages. In principle, this design has a lot of promise: it transforms Twitter into a kind of real-world experimental environment for political and psychological research. The technique has already been demonstrated in domains such as sectarianism in the Middle East (Siegel and Badaan 2018). But illustrating the pitfalls of relying on third-party platforms for research purposes, Twitter’s increasing vigilance

190

Andrew M. Guess

against automated bots has made this type of design more difficult to implement – at least without substantially increased human involvement. 10.2.2 Treatments on Facebook Although Facebook lacks the open architecture of Twitter, its features can also be adapted for use in experimental research. Given political campaigns’ increasing use of Facebook, these designs may be especially useful for questions related to the effect of online advertisements and other types of content on political participation. Some recent studies have cleverly demonstrated how Facebook Groups can be used to boost the proportion of subjects’ News Feeds containing treatment messages. In one study, Feezell (2018) randomly assigned students to join one of two Facebook Groups controlled by the researcher. Groups have several advantages for the design of experiments. Their visibility to others outside of the Group can be restricted, thus minimizing spillover concerns. Researchers can also measure engagement with individual posts and observe the number of Group members who merely see a piece of content – a potentially useful feature for studies of the effects of online political media exposure. In the case of the Feezell (2018) study, subjects had completed a baseline survey before entering the experiment and were given a second wave designed to collect measures of the dependent variable: perceived issue importance. In the above example, Groups are used primarily because individual posts are visible in Group members’ News Feeds alongside the other content that they would otherwise have also seen, giving the study a high degree of ecological validity. For this reason, the treatment and control groups are given innocuous names that shield their primary purpose. But for other research questions, the effect of Groups themselves may be of interest. Foos et al. (2018) work with an activist group in Bulgaria to test the effectiveness of a Facebook Group-based environmental

campaign. By randomizing participants to either a Facebook condition or an email condition, they can study how a networked, social media approach may differ in effectiveness compared to a traditional, email-based campaign. Given the centrality of friends – undirected edges in the social graph – on the Facebook platform, researchers have also explored reviving the experimental tradition of employing confederates. Haenschen (2016) does this in conjunction with Facebook’s Friend List capability, which allows users to include or exclude individual friends from the ability to see certain posts. The researcher samples users from the voter file, matches them to the friends of the confederates, and randomly assigns subjects to receive socialpressure get-out-the-vote messages that contain elements of shame or pride. From the perspective of the experimental subjects, the treatment could appear quite powerful: being specifically tagged by one of your Facebook friends in a status update, visible to others, with a message urging you to vote in an upcoming local election (in Texas, in this case). Accordingly, these treatments were quite powerful, with the “shame” message generating an estimated effect of more than 20 percentage points on turnout. Variants of this functionality promise to shed light on the effects of real-world interpersonal appeals on political participation. 10.2.3 Facebook Ads Given political campaigns’ growing use of Facebook advertisements, field studies on their effects – both offline and on – are likely to increase in importance. This is true for at least two reasons. First, the ability to target and randomly assign exposure to ads means avoiding many of the well-known difficulties of studying campaign effects in the offline world. As a result, researchers have a unique opportunity to build new theories of persuasion and mobilization whose usefulness could transcend the medium. Second, while most political ads are still shown on television, the online arena will

Experiments Using Social Media Data

only continue to grow in importance. Aside from Facebook ads, political campaigns are transmitting messages via streaming video, Google ads, and even video games. Conducting experiments on the effects of online ads will help generate knowledge about the particular affordances, quirks, and drawbacks of online political communication. As in the offline world (Gerber et al. 2011), the most straightforward way of evaluating the impact of online political advertisements is to partner with campaigns themselves (see Chapter 11 in this volume). Such partnerships in the US context have yet to find their way into the academic literature, but successful field experiments have been conducted elsewhere. For instance, in Germany, Hager (2019) worked with a political party to randomize Berlin postal districts to be shown ads via both Facebook’s and Google’s platforms. In this setting, individual-level data are not available, but district-level vote shares are computed to measure outcomes. Unusually in this literature, the study found a wellestimated 0.5 percentage point increase in the party’s vote share. Fortunately, actual partnerships are not necessary to take advantage of the randomization capabilities embedded in the platform. A challenge is to leverage targeting features (such as geography, political affiliation, and age) while maintaining experimental control. Like Twitter, Facebook’s ad platform uses an auction-based system to set prices. Since it is not guaranteed that an ad will be successfully delivered to everyone in a given targeting category or delivered with some known probability of assignment, it may make sense to randomize at a higher level of aggregation. Ryan (2012) demonstrates how the demographic and geographic targeting features of Facebook Ads can be used to randomly assign clusters of users to see specific campaigns. Thus, treatment can be randomized over hundreds of mutually exclusive and exhaustive demographic subcategories. (See Ryan and Broockman 2012 for additional details.) This approach has been replicated and extended to research on numerous topics, such as selective exposure, the role of emotion

191

in political advertising, and effects on voting (Broockman and Green 2014; Collins et al. 2014; Ryan and Brader 2017). It is now possible to push these capabilities further by, for example, uploading voter lists to allow for individual-level randomization. Such an approach could be combined with baseline survey recruitment (see previous section) so that respondents can be randomized into a separate Facebook-based experiment with “as-if-unrelated” outcomes collected in a subsequent survey wave (again see Chapter 4 in this volume). In addition to making measurement as unobtrusive as possible, this design would also have improved power over the cluster-randomized variant described above. A final possible area for innovation is to push further on the opportunities afforded by Facebook’s targeting capabilities. Thus far, political scientists have used them to reconstruct the voting-age population of specific districts (Broockman and Green 2014); to focus on people with particular ideological affiliations (Ryan and Brader 2017); and to study non-ideologues (proxied by, for example, interest in music and sports) (Ryan and Brader 2017). As psychologists have demonstrated in work using Facebook ads, however, there are boundless possibilities for targeting people on the basis of political or nonpolitical preferences – for example, those who “like” specific pages (e.g., Matz et al. 2017). Given increasing levels of affective polarization in the mass public and arguments that social media has contributed to this rise (Settle 2018; Sunstein 2017), there are substantive reasons to further explore the relationship between lifestyle preferences and political outcomes. Research of this kind has become controversial for numerous reasons, so it is important to note the potential ethical implications for real-world research on social media employing built-in targeting techniques. Psychological research in this area has been directly implicated in debates over privacy and electoral manipulation, due in part to public concerns about personalitybased profiles of voters. I discuss some of the

192

Andrew M. Guess

ethical implications of this kind of research in Section 10.5.

10.3 Collecting Outcomes on Social Media 10.3.1 On-Platform Engagement Social media enables a variety of approaches to collecting outcomes. In the running example, the researchers additionally used the Twitter API to scrape all tweets on the platform that included the promotional campaign’s web link. The campaign was designed so that this link was unique to the ads shown in the treatment. The main result is that, despite not boosting petition signatures, the PT treatments boosted the proportion of subjects who tweeted about (or retweeted) the campaign by roughly 2 percentage points. The effect of DMs was lower but still statistically distinguishable from zero. This is strong evidence that Twitter Ads can generate organic activity on the platform, whether through direct engagement or through amplification – retweeting the message to one’s own followers. Here, as in the other examples in this subsection, social media generates behavioral outcomes that can be collected via passive measurement strategies, a key feature that is often difficult to achieve in offline contexts. Another type of social media activity (i.e., on Twitter) that could be used as an experimental outcome is following (or unfollowing) behavior. When encountering a message or other content posted on the platform, subjects may choose to follow the associated account so that they can potentially see future posts from that user. Doing so could be a measure of interest or a measure of treatment compliance, for example. Over time, this kind of following/unfollowing behavior constitutes a rewiring of people’s social networks – itself a potential subject of interest. On Facebook and Twitter, it is possible to use click-throughs as experimental outcomes. The most straightforward form of interaction with a Facebook ad is to click through it, a metric easily accessible to the user who posted the campaign (Ryan 2012; Ryan and

Brader 2017).5 Other types of on-platform interactions can be used as outcomes: commenting behavior, reactions, and even views could be logged within the Facebook Groups created in the Feezell (2018) study. Of course, these types of “engagement metrics” may or may not translate to substantive behaviors of interest. (E.g., in Collins et al. 2014, Facebook “likes” of a mobilization ad campaign do not translate to actual votes.) A final – and mostly hypothetical – possibility for collecting Facebook behavioral outcomes (see Chapter 13 in this volume) is to ask users to voluntarily share specific profile data. In principle, this is still possible: users can give researchers limited permission to access data via Facebook’s Graph API.6 Here, again, ethical and privacy considerations may weigh more heavily than technological ones. If arrangements could be made to address these concerns, however, it may be feasible to gain these permissions on a temporary basis at the subject recruitment stage and then to generate subject-level, aggregate counts to be used as post-treatment outcomes. An example of a dependent variable that could be measured in this way is whether or not a subject shared a specific link or type of link with his or her friends. So far, use of such outcomes has been demonstrated in observational research (Guess et al. 2019a), although it is unclear whether this approach will be available to researchers in the future. 10.3.2 Surveys and Offline Behavior In addition to on-platform outcomes, it is often straightforward to link social media data to offline data sources such as the voter file (e.g., Barberá 2015; Broockman and Green 2014; Grinberg et al. 2019; Haenschen 2016). And, as mentioned above, when subjects are first recruited in a survey, subsequent waves can be administered 5 Matz et al. (2017) additionally examine “conversions,” or the proportion of those who see an ad who actually purchase or sign up for the product or service advertised. 6 See https://developers.facebook.com/docs/graphapi/reference/v3.3/user/feed. Researchers would first need to have an approved app that facilitates the sharing of user credentials.

193

Experiments Using Social Media Data

post-treatment to collect outcome measures (see Chapter 4 in this volume). This is likely easier and more cost-effective than surveying categories of voters who have been targeted on-platform (Broockman and Green 2014), although that remains a viable option.

Table 10.1 Effect of exposure to a friend’s tweet.

10.4 Studying Network Effects on Social Media

Exposure to tweet in network Constant

Often, the reason for conducting a study on social media is to focus on the social element. Designing experiments using such data creates new opportunities for studying patterns of network diffusion. Social media data can be incredibly valuable in this area because it is otherwise rare to have randomized studies of peer effects with fully observable social network connections. But on Twitter, for instance, follow networks are public and can be easily scraped prior to an experimental intervention (e.g., Coppock et al. 2016). One approach, the peer encouragement design, holds particular promise. In this design, researchers randomly encourage subjects’ social connections, alters, to take an action (Eckles et al. 2016). In the running example, the researchers embedded a peer encouragement experiment via a button asking petition signers to send a tweet about the campaign to their own Twitter followers. This nudge echoes the way in which many organizations strive for virality in online campaigns. Done in a controlled way, the researchers were able to find evidence of peer effects through the social network. Prominent voter-mobilization studies on Facebook also use a version of this design (Bond et al. 2012; Jones et al. 2017). Table 10.1 shows how this worked. Column (1) displays the first stage, in which being shown the tweet button increased the likelihood that subjects would tweet about the campaign to their own followers. To consistently estimate the effect of this on these followers’ activity, the authors rely on the exclusion restriction that assignment of subjects to get the button does not have an effect on the subjects’ followers’ outcomes except via the tweets it causes the subjects to post.

Tweeted (1) Shown button

n R2

Signed (2)

Tweeted (3)

0.0003 (0.0003) 0.001∗∗∗ (0.0001) 171,386 0.001

0.001∗∗∗ (0.0003) 0.0001 (0.0001) 171,386 0.001

0.017∗∗∗ (0.002)

0.013∗∗∗ (0.002) 13,559 0.004

∗ p < 0.1; ∗∗ p < 0.05; ∗∗∗ p < 0.01.

Instrumental variables regression in which tweeting is the endogenous variable and assignment to the tweet button is the instrument. First stage is shown in Column (1). Columns (2) and (3) include all followers of followers in the network. Instrumental variables regressions use inverse probability weights proportional to the number of people they follow assigned to receive the button.

Since assignment to the button was random, however, this is a valid assumption. Using the tweet button encouragement as an instrument for tweeting the link, Column (2) shows no discernible effect on whether the subjects’ own followers followed the link and signed the online petition – a disappointing outcome for an advocacy organization. What the button does appear to have done, however, is increase (albeit by a minuscule amount) the likelihood that subjects’ followers tweet about or retweet the campaign link. Virality is difficult to achieve by design, especially for goals that require action outside of the platform environment itself. A subtle advantage of this type of peer encouragement design is that it can reduce the number of spillovers that are likely to occur, and thus makes more tractable any necessary modeling of spillover effects. In both this running example and in the Coppock et al. (2016) study, only a fraction of subjects click into the online petition, which doubles as a survey, after which the peer encouragement is administered. This ensures that the network of remaining subjects is sparse, even

194

Andrew M. Guess

though subjects were selected via a purposive method – sampling from the followers of a group’s Twitter account – that can produce dense interconnections between experimental participants.

10.5 Ethics and Experimentation on Social Media Despite their relatively short history, randomized field experiments involving social media platforms have already generated more than their share of controversy. There is evidence of an aversion among the public to testing the effectiveness of policies via controlled experimentation, dubbed the “A/B effect” (Meyer et al. 2019): deploying a policy or intervention to an entire population without testing it first is seen as preferable to randomly assigning two alternatives to assess them rigorously. In the minutely personalized realm of social media, this kind of testing may additionally bring to mind the staggering amounts of information platforms have amassed on their users by logging their behaviors, preferences, and interactions with each other. When combined with a lack of public understanding about how these data are used and with whom they are shared, perhaps it is not surprising that experiments on social media users have raised comparisons to historical abuses of power by scientists and researchers (for a general discussion of experimental ethics, see Chapter 7 in this volume). That said, not all large-scale social media experiments have generated public backlash. On the one hand, a well-known study of emotional contagion on Facebook led to a massive outcry and a debate about informed consent on social media (Kramer et al. 2014; Verma 2014). In that study, a research team consisting of both Facebook scientists and outside academics randomly manipulated the probability with which Facebook posts with a negative or positive emotional valence (as measured by a standard dictionary-based scoring method) would appear on a user’s News Feed. The researchers were interested in whether these shifts would be reflected in the emotional content of users’ own posts.

Yet, two years earlier, a highly publicized randomized experiment conducted by Facebook that caused hundreds of thousands of votes to be cast throughout the USA in the 2010 midterm elections generated mainly positive coverage in the media (Bond et al. 2012). While necessarily anecdotal, these divergent reactions suggest that the specific type of intervention, rather than the sheer scale, may be a factor that determines whether studies are framed as innovative contributions to knowledge or echoes of a dark past in scientific experimentation. Partially for these kinds of reasons, much experimental research on social media effects takes place in artificial settings using mock stimuli.7 These “off-platform” studies often feature quite convincing simulated environments shown to laboratory subjects or survey respondents. Outcomes take the form of self-reported measures, such as perceived news headline accuracy and hypothetical intent to share a social media post (e.g., Anspach et al. 2019; Bode 2016; Bode and Vraga 2018; Pennycook et al. 2018). The external validity of these designs depends largely on the extent to which self-reported behavioral intent measures correspond to how subjects would actually respond to comparable stimuli on-platform. There have been attempts to validate such measures using surveys linked to digital trace data (Guess et al. 2019b; Haenschen 2019), but these necessarily deal with observational data. It remains an unfortunate fact that there has been no systematic demonstration of the correspondence between similarly designed off-platform and on-platform experiments in the social media realm. Aside from wellknown biases that can arise from selfreporting behaviors, a particular area of concern for the validity of social media experiments conducted in simulated environments is the absence of social observability. 7 This is true for academic research and, increasingly, even for research done by the platforms themselves. An alternative approach is to think of a social media platform as a bundled treatment whose overall use can be encouraged or discouraged. In their Facebook deprivation study, Allcott et al. (2020) paid subjects to deactivate their accounts for a month.

Experiments Using Social Media Data

Treatments whose hypothesized mechanisms operate via social ties – or perceptions thereof – can certainly be attempted offplatform using staged interactions or mock accounts, but these are unlikely to be perceived by subjects as credible.8 Until attempts are made to validate currently feasible research approaches with on-platform benchmarks – likely only possible with cooperation from the platforms themselves – it will be difficult to assess the generalizability of much current academic research on social media. Given that robust collaborations on substantive social science questions (as with the emotional contagion and voter mobilization studies) seem unlikely in the foreseeable future, perhaps a more attainable goal is to work toward a set of validated best practices that could enable independent researchers to continue generating useful knowledge while minimizing the need to navigate uncharted ethical waters. Fortunately, excellent models for such an effort already exist: Jerit et al. (2013) ran simultaneous survey and field experiments on media effects, and Berinsky et al. (2012) replicated canonical experimental findings to demonstrate the usefulness of the Mechanical Turk platform for subject recruitment. These studies could provide a starting point for researchers seeking to establish the conditions under which lab-like designs for social media are externally valid.

10.6 Discussion This chapter has laid out a variety of possible avenues for using social media data at each stage of an experimental research design – sampling, designing treatments, and 8 Prior to the Cambridge Analytica scandal (see: https://en.wikipedia.org/wiki/Facebook-Cambridge_ Analytica_data_scandal) and other privacy-related changes, Facebook allowed some access to users’ friend graph data through its API for authorized research purposes. Some studies using these data were published without controversy. For example, Turcotte et al. (2015) provide a glimpse into how it was possible to conduct survey-based experiments in which mock social media posts could be “shared” by subjects’ actual Facebook friends, thus offering a dose of realism.

195

collecting outcomes – as well as possibilities for taking advantage of the network ties enabled by these platforms. If the designs researchers have devised so far are any indication, research using social media will continue to innovate by using platform features in new and creative ways. Given the increasing emphasis both political campaigns and political actors are placing on social media, its importance for political science will continue to increase, whether or not the questions of interest specifically concern social media. A particularly fruitful way forward will be to combine different design and measurement elements in novel ways. This could especially be true in research that spans the boundary between online and offline. Imagine an on-the-ground persuasion experiment conducted with an organization that, instead of standard outcomes, uses geotargeted social media activity as a passive measure of sentiment or intensity. People expressing themselves on social media may be of interest in this scenario (e.g., King et al. 2017), but, from another perspective, it could be viewed as a proxy for discussions (of any type) generated throughout a community. Or social media posting frequency could be used as a measure of political interest, which could then inform design decisions for a more general study of engagement. Conducting experiments with social media data has its own set of challenges that are continuing to be addressed by the scholarly community. As with these debates, a defining feature of research incorporating online data is that external validity concerns are especially acute in the temporal dimension (Munger 2019). An experiment designed on 2014-vintage Facebook may no longer be possible today; even more, people adapt expectations and behavior over time, which suggests that even the same design could yield contradictory results at two points in time. These issues illustrate the pitfalls of designing and executing research on a single platform or set of platforms owned by large corporations whose rules can change without warning and whose business models are not necessarily aligned with scientific

196

Andrew M. Guess

imperatives. But as more than two-thirds of US adults use Facebook as a primary means of communication, political or otherwise, research on political behavior may eventually become synonymous with research on online political behavior. Regardless, this research should not be left solely to the platforms themselves.

References Allcott, Hunt, Luca Braghieri, Sarah Eichmeyer, and Matthew Gentzkow. 2020. “The Welfare Effects of Social Media.” American Economic Review 110(3): 629–676. Anspach, Nicolas M., Jay T. Jennings, and Kevin Arceneaux. 2019. “A Little Bit of Knowledge: Facebook’s News Feed and Self-Perceptions of Knowledge.” Research & Politics. DOI: 10.1177/2053168018816189. Bail, Christopher A., Lisa P. Argyle, Taylor W. Brown, John P. Bumpus, Haohan Chen, M. B. Fallin Hunzaker, Jaemin Lee, Marcus Mann, Friedolin Merhout, and Alexander Volfovsky. 2018. “Exposure to Opposing Views on Social Media Can Increase Political Polarization.” Proceedings of the National Academy of Sciences of the United States of America 115(37): 9216–9221. Bakshy, Eytan, Solomon Messing, and Lada A. Adamic. 2015. “Exposure to Ideologically Diverse News and Opinion on Facebook.” Science 348(6239): 1130–1132. Barberá, Pablo. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23(1): 76–91. Barberá, Pablo. 2016. “Less is more? How demographic sample weights can improve public opinion estimates based on Twitter data.” Technical report. Barberá, Pablo, Andreu Casas, Jonathan Nagler, Patrick Egan, Richard Bonneau, John T. Jost, and Joshua A. Tucker. 2019. “Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data.” American Political Science Review 113(4): 883–901. Berinsky, Adam, Gregory Huber, and Gabriel Lenz. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20(3): 351–368.

Bode, Leticia. 2016. “Political News in the News Feed: Learning Politics from Social Media.” Mass Communication and Society 19(1): 24–48. Bode, Leticia, and Emily K. Vraga. 2018. “See Something, Say Something: Correction of Global Health Misinformation on Social Media.” Health Communication 33(9): 1131–1140. Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. 2012. “A 61-Million-Person Experiment in Social Influence and Political Mobilization.” Nature 489(7415): 295–298. Broockman, David, and Donald Green. 2014. “Do Online Advertisements Increase Political Candidates’ Name Recognition or Favorability? Evidence from Randomized Field Experiments.” Political Behavior 36(2): 263–289. Collins, Kevin, Laura Keane, and Josh Kalla. 2014. “Youth Voter Mobilization through Online Advertising: Evidence From Two GOTV Field Experiments.” Paper presented at the Annual Meeting of the American Political Science Association, Washington, DC. Coppock, Alexander, Andrew Guess, and John Ternovski. 2016. “When Treatments Are Tweets: A Network Mobilization Experiment over Twitter.” Political Behavior 38(1): 105–128. Druckman, James N., and Donald P. Green. 2013. “Mobilizing Group Membership: The Impact of Personalization and Social Pressure E-mails.” Sage Open. DOI: 10.1177/2158244013492781. Eckles, Dean, René F. Kizilcec, and Eytan Bakshy. 2016. “Estimating Peer Effects in Networks with Peer Encouragement Designs.” Proceedings of the National Academy of Sciences of the United States of America 113(27): 7316–7322. Feezell, Jessica T. 2018. “Agenda Setting through Social Media: The Importance of Incidental News Exposure and Social Filtering in the Digital Era.” Political Research Quarterly 71(2): 482–494. Foos, Florian, Lyubomir Kostadinov, Nikolay Marinov, and Frank Schimmelfennig. 2018. “Does Social Media Promote Civic Activism? A Field Experiment with a Civic Campaign.” Working paper. URL: www.florianfoos .net/resources/Foos_et_al_SocialMedia.pdf. Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W. W. Norton. Gerber, Alan S., James G. Gimpel, Donald P. Green, and Daron R. Shaw. 2011. “How Large

Experiments Using Social Media Data and Long-Lasting Are the Persuasive Effects of Televised Campaign Ads? Results from a Randomized Field Experiment.” American Political Science Review 105(1): 135–150. Grinberg, Nir, Kenneth Joseph, Lisa Friedland, Briony Swire-Thompson, and David Lazer. 2019. “Fake News on Twitter during the 2016 U.S. Presidential Election.” Science 363(6425): 374–378. Guess, Andrew M., Alexander Coppock and Kevin Collins. n.d. “Petitioning the Court: Testing Promoted Tweets and DMs in a Networked Field Experiment.” Unpublished research report. Guess, Andrew, Jonathan Nagler, and Joshua A. Tucker. 2019a. “Less than You Think: Prevalence and Predictors of Fake News Dissemination on Facebook.” Science Advances 5(1): eaau4586. Guess, Andrew, Kevin Munger, Jonathan Nagler, and Joshua Tucker. 2019b. “How Accurate Are Survey Responses on Social Media and Politics?” Political Communication 36(2): 241–258. Haenschen, Katherine. 2016. “Social Pressure on Social Media: Using Facebook Status Updates to Increase Voter Turnout.” Journal of Communication 66(4): 542–563. Haenschen, Katherine. 2019. “Self-Reported versus Digitally Recorded: Measuring Political Activity on Facebook.” Social Science Computer Review. DOI: 10.1177/0894439318813586. Hager, Anselm. 2019. “Do Online Ads Influence Vote Choice?” Political Communication 36(3): 376–393. Jerit, Jennifer, Jason Barabas, and Scott Clifford. 2013. “Comparing Contemporaneous Laboratory and Field Experiments on Media Effects.” Public Opinion Quarterly 77(1): 256–282. Jones, Jason J., Robert M. Bond, Eytan Bakshy, Dean Eckles, and James H. Fowler. 2017. “Social Influence and Political Mobilization: Further Evidence from a Randomized Experiment in the 2012 U.S. Presidential Election.” PLoS one 12(4): e0173851. Karpf, David. 2013. “The Internet and American Political Campaigns.” The Forum 11(3): 413–428. King, Gary, Benjamin Schneer, and Ariel White. 2017. “How the News Media Activate Public Expression and Influence National Agendas.” Science 358(6364): 776–780. Klar, Samara, and Thomas J. Leeper. 2019. “Identities and Intersectionality: A Case for Purposive Sampling in Survey-Experimental Research.” In Experimental Methods in Survey

197

Research: Techniques That Combine Random Sampling with Random Assignment, eds. Paul Lavrakas, Michael Traugott, Courtney Kennedy, Allyson Holbrook, Edith de Leeuw, and Brady West. Hoboken, NJ: John Wiley & Sons, pp. 419–433. Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences of the United States of America 111(24): 8788–8790. Matz, Sandra C., Michal Kosinski, Gideon Nave, and David J. Stillwell. 2017. “Psychological Targeting as An Effective Approach to Digital Mass Persuasion.” Proceedings of the National Academy of Sciences of the United States of America 114(48): 12714–12719. Meyer, Michelle N., Patrick R. Heck, Geoffrey S. Holtzman, Stephen M. Anderson, William Cai, Duncan J. Watts and Christopher F. Chabris. 2019. “Objecting to Experiments That Compare Two Unobjectionable Policies or Treatments.” Proceedings of the National Academy of Sciences of the United States of America 116(22): 10723–10728. Munger, Kevin. 2017. “Tweetment Effects on the Tweeted: Experimentally Reducing Racist Harassment.” Political Behavior 39(3): 629–649. Munger, Kevin. 2019. “The Limited Value of Non-Replicable Field Experiments in Contexts with Low Temporal Validity.” Social Media + Society. DOI: 10.1177/2056305119859294. Munger, Kevin, Jonathan Nagler, Joshua Tucker, and Mario Luca. n.d. “Everyone on Mechanical Turk is Above a Threshold of Digital Literacy: Sampling Strategies for Studying Digital Media Effects.” URL: http://kmunger.github .io/pdfs/clickbait_mturk.pdf Mutz, D. C., and L. Young. 2011. “Communication and Public Opinion.” Public Opinion Quarterly 75(5): 1018–1044. Pennycook, Gordon, Tyrone D. Cannon, and David G. Rand. 2018. “Prior Exposure Increases Perceived Accuracy of Fake News.” Journal of Experimental Psychology: General 147(12): 1865–1880. Pew. 2019a. “National Politics on Twitter: Small Share of U.S. Adults Produce Majority of Tweets.” URL: www.people-press .org/2019/10/23/national-politics-on-twittersmall-share-of-u-s-adults-produce-majorityof-tweets/ Pew. 2019b. “Share of U.S. Adults Using Social Media, Including Facebook, Is

198

Andrew M. Guess

Mostly unchanged since 2018.” URL: www .pewresearch.org/fact-tank/2019/04/10/shareof-u-s-adults-using-social-media-includingfacebook-is-mostly-unchanged-since-2018/. Ryan, Timothy J. 2012. “What Makes Us Click? Demonstrating Incentives for Angry Discourse with Digital-Age Field Experiments.” Journal of Politics 74(4): 1138–1152. Ryan, Timothy J., and David E. Broockman. 2012. “Facebook: A New Frontier for Field Experiments.” Newsletter of the APSA Experimental Section 3(2): 2–10. Ryan, Timothy J., and Ted Brader. 2017. “Gaffe Appeal: A Field Experiment on Partisan Selective Exposure to Election Messages.” Political Science Research and Methods 5(4): 667–687. Settle, Jaime E. 2018. Frenemies: How Social Media Polarizes America. Cambridge, UK: Cambridge University Press. Siegel, Alexandra, and Vivienne Badaan. 2018. “#No2Sectarianism: Experimental Approaches to Reducing Sectarian Hate Speech Online.” Working paper. URL: https://alexandra-siegel .com/wp-content/uploads/2018/11/Siegel_ Badaan_Nov2018.pdf Steinert-Threlkeld, Zachary C. 2017. Twitter as Data. Cambridge, UK: Cambridge University Press. Sunstein, Cass R. 2017. #Republic. Princeton, NJ: Princeton University Press. Teresi, Holly, and Melissa R. Michelson. 2015. “Wired to Mobilize: The Effect of Social Networking Messages on Voter Turnout.” Social Science Journal 52(2): 195–204. Turcotte, Jason, Chance York, Jacob Irving, Rosanne M. Scholl, and Raymond J. Pingree. 2015. “News Recommendations from Social Media Opinion Leaders: Effects on

Media Trust and Information Seeking.” Journal of Computer-Mediated Communication 20(5): 520–535. Vaccari, Cristian, Augusto Valeriani, Pablo Barberá, Rich Bonneau, John T. Jost, Jonathan Nagler, and Joshua A. Tucker. 2015. “Political Expression and Action on Social Media: Exploring the Relationship Between Lowerand Higher-Threshold Political Activities among Twitter Users in Italy.” Journal of Computer-Mediated Communication 20(2): 221–239. Vaccari, Cristian, Augusto Valeriani, Pablo Barberá, Richard Bonneau, John T. Jost, Jonathan Nagler, and Joshua Tucker. 2013. “Social Media and Political Communication: A Survey of Twitter Users During the 2013 Italian General Election.” Rivista italiana di scienza politica 43(3): 381–410. Verma, Inder M. 2014. “Editorial Expression of Concern: Experimental Evidence of Massivescale Emotional Contagion through Social Networks.” Proceedings of the National Academy of Sciences of the United States of America 111(29): 10779–10779. Wang, Lucy X., Arthi Ramachandran, and Augustin Chaintreau. 2016. “Measuring Click and Share Dynamics on Social Media: A Reproducible and Validated Approach.” In Tenth International AAAI Conference on Web and Social Media. Palo Alto, CA: Association for the Advancement of Artificial Intelligence, pp. 108–113. Zhang, Baobao, Matto Mildenberger, Peter D. Howe, Jennifer Marlon, Seth A. Rosenthal, and Anthony Leiserowitz. 2020. “Quota Sampling using Facebook Advertisements.” Political Science Research and Methods 8(3): 558–564.

C H A P T E R 11

How to Form Organizational Partnerships to Run Experiments∗

Adam Seth Levine

Abstract There is growing interest in bridging the gap between science and society. Fostering collaborations between academics and practitioners, such as partnering to conduct experiments, is an increasingly popular way to do that. Yet, despite the growing number of such partnerships, academics who are new to them often lack guidance about the considerations to keep in mind and the steps involved. This chapter fills that gap. I discuss the benefits, challenges, and goals of organizational partnerships, as well as provide a step-by-step guide for academics beginning new ones. Throughout, I emphasize the fact that such partnerships entail building new working relationships with people who have diverse forms of knowledge. As a result, both a learning mindset as well as a relational mindset are necessary.

11.1 Introduction Many have called for bridging the gap between science and society in order to better understand, explain, and mitigate pressing social problems (Druckman and Lupia 2015; National Research Council 2012; Watts 2017). Central to this goal is * I owe many thanks to David Broockman, Colin Cepuran, James N. Druckman, Donald P. Green, Varja Lipovsek, Mary McGrath, Adrienne Scott, Ari Shaw, Chagai Weiss, and Alisa Zomer for their extremely helpful, thorough, and thought-provoking feedback.

fostering collaborations between academics and practitioners (Nutley et al. 2007). An active research community across many disciplines has identified ways to do so, such as research–practice partnerships (Coburn and Penuel 2016), knowledge brokering (Dobbins et al. 2009), partnerships with aid organizations (Karlan and Appel 2016), and university extension programs (Chambliss and Lewenstein 2012). Within political science, there has also been a long-standing desire to close this gap (George 1993). Experiments involving 199

200

Adam Seth Levine

organizational partnerships are an increasingly common way to do that. One recent study found that 62% of articles with field experiments published between 2000 and 2017 in American Political Science Review, American Journal of Political Science, and Journal of Politics entailed a partnership (Butler 2019). Yet academics who are new to these partnerships lack formal guidance about what they entail, especially as process-related details rarely (if ever) appear in published work. Thus, learning has typically occurred via informal conversations and personal trial and error, both of which advantage some researchers and disadvantage others. The goal of this chapter is to fill this gap by providing a systematic overview of the process along with a detailed discussion of the opportunities and challenges. Although each partnership entails its own particular nuances, possibilities, and constraints, my aim is to provide an overview that applies broadly. Throughout the chapter, I advocate a particular approach. When academics begin a new experiment, they naturally have what I would refer to as a learning mindset. They are focused on what they want to learn, how doing so will advance our collective understanding of the world and potentially lead to a new publication, and what an optimal and feasible research design might entail. Having a learning mindset is certainly important, yet I argue that academics also need to adopt a relational mindset when these experiments entail partnering with an organization. These partnerships entail building relationships with people who have diverse forms of knowledge. They are a form of civic engagement in which diverse individuals work together to better understand and ameliorate the problems facing their communities and society at large (Allen 2016). They produce private benefits for the participants, such as publications, funding, and so on. They also help to establish norms of collaboration between the research and practice communities, which is a public benefit. My goal throughout the chapter is also to highlight illustrative examples. Yet

achieving this goal is more difficult than it may initially appear. Details about the inception of partnerships are typically unpublished, and in conversations with others I have often found that people struggle to recall exactly what happened. Thus, I will mostly be drawing from my own experiences forming partnerships. In addition, at various moments I will refer to examples of partnerships that I have helped create as president of research4impact (r4impact.org), a nonprofit organization that connects researchers and practitioners for many reasons, including to collaborate on experiments. Because of my matchmaking role within this organization, I have a unique window into relevant background information for various partnerships.

11.2 Definitions and Scope For purposes of this chapter, I define organizational partnerships as academics working with practitioners (people in the nonprofit, government, and/or for-profit sector) in order to conduct experiments together. They are a type of formal collaboration that entails shared decision-making authority and a willingness to be held accountable to each other. During these partnerships, all parties are at least somewhat involved with the various stages of conducting an experiment: conceptualization, design, fieldwork, analysis, and dissemination of the findings. I acknowledge that academics and practitioners may partner to conduct nonexperimental research as well, but for the purposes of this chapter, I focus on situations in which new experimental data are the desired result. Note that this definition would not include consulting arrangements with a fee-for-service model. With organizational partnerships, typically no money changes hands – instead, the main “payment” for academics is a data set they can use in publications. The primary audience for this chapter is academics who are just beginning to have (or are thinking about having) conversations with practitioners about partnering on an

How to Form Organizational Partnerships to Run Experiments

experiment. I focus on the considerations and steps that are vital regardless of the partner, though it is worth noting that partnerships with government agencies often involve extra steps (e.g., contracting regulations), and partnerships that occur as part of a preexisting fellowship (e.g., the Office of Evaluation Sciences fellowship) may involve fewer. Due to space limitations, I will not discuss important normative questions about the purpose of social science and whether academics should be partnering with particular organizations on particular projects. Instead, I proceed under the assumption that readers are interested in “solving practical problems that outsiders would recognize” (Watts 2017, p. 3), and that they view a partnership as a good way to do that.

11.3 Why Pursue an Organizational Partnership? In the next two sections, I discuss benefits and challenges. I start here with the benefits: Why pursue an organizational partnership? What goals do partners want to achieve? The most fundamental answer is the same reason why academics choose to partner with each other on a research project: they are intensely curious and share underlying goals. For example, many academics and practitioners ultimately want to eliminate corruption, make government work better, reduce poverty, increase voter engagement, improve health, confront climate change, eliminate prejudice, reduce electoral fraud, and so on. That said, even if they share the same underlying goals, they may have distinct professional reasons for partnering to conduct an experiment. Scholars (in their work) approach these topics by thinking about how studies can inform underlying theoretical questions and speak to mechanisms that are generalizable. Practitioners (in their work) approach these topics from the perspective of wanting to know what works and how new knowledge can directly inform their organizational policies and programs. So, for

201

instance, scholars interested in how to reduce corruption are often motivated by theoretical questions about institutional design and are mindful about how one individual intervention will add knowledge to a broader body of literature. Practitioners working to reduce corruption want to know, first and foremost, whether a given intervention works and if it is something that may be feasibly implemented on a broader scale. 11.3.1 What Benefits Do Academics Gain from Partnering with Practitioners? The main benefit is the opportunity to answer a question that simultaneously has theoretical and practical significance. Academics who pursue partnerships care about the world and want their work to have impact. Partnering with an organization to design and implement a study can greatly increase the likelihood that the results will impact organizational practice, public policy, and attitudes of those outside academia (cf., Coburn and Penuel 2016). Working with a practitioner provides unique and powerful insights into what questions are most relevant from the perspective of real-world decisionmakers. Organizational partnerships also offer the opportunity to collect behavioral and administrative data that may otherwise not be feasible. For example, if scholars want to understand how well a mentoring intervention operates among rural workers in Kenya, it is likely more feasible (if not necessary) to partner with an organization that has the credibility to administer that intervention in local communities, along with the capacity, knowledge, and governmental authority to do so. Moreover, on-the-ground knowledge from organizational partners provides insights into subtle contextual features, like who should and who should not be part of the study population. 11.3.2 What Benefits Do Practitioners Gain from Partnering with Academics? As noted above, practitioners are interested in partnerships when the results will speak

202

Adam Seth Levine

directly to what works and/or does not work (i.e., if the findings are directly relevant to how they can effectively achieve their goals). Practitioners may also appreciate the broader outlook of academics who often have more time and incentive to situate specific observations in a broader body of knowledge. They may also partner in order to satisfy funding demands, as funders sometimes want an outside party to evaluate claims of effectiveness. Lastly, in addition to these instrumental motivations, practitioners are often intrinsically motivated, just like scholars. The idea of producing new discoveries and knowledge can be exciting and fun! 11.3.3 What Goals Do Organizational Partnerships Often Pursue? The experiments partnerships pursue typically have one of two main goals: to assess the impact of existing activities or to test a new idea that hasn’t been tried before. The first goal is to assess the impact of existing activities: How well are the things they are already doing working? In many cases, practitioners pursue this goal because they want to design a randomized controlled trial where one did not exist beforehand. For instance, one of the organizations that reached out to research4impact in 2018 to find a potential collaborator was looking to increase voter turnout. Based in the UK, this organization had already launched a new website that included user-friendly information about candidate positions, polling locations, and the like. It had been regularly tracking who visited the site, but “clicks on a website” are not the same thing as actually boosting voter turnout. The leadership wanted to think about how to conduct a randomized controlled trial within their website that would test whether the information they provided had a causal impact on voter turnout. In that example – and this is true more generally for “impact assessment” experiments – the organization would be mostly responsible for supplying the research question (i.e., “Does this program work? What impact does this program have?”). The

academic would supply technical expertise about how to design the assessment, along with substantive knowledge of relevant literature and previous findings. Costs may be shared in a variety of ways, depending upon the amount and also the degree of postintervention measurement. The second broad type of partnership goal is to test a new idea that hasn’t been tried before. On the organizational side, these partnerships may be valuable because practitioners want to explore entirely new ideas for furthering their mission and addressing problems. For example, one organization that reached out to research4impact was under contract with a government agency and tasked with testing new ways to design forms for social benefits that would reduce churn (i.e., people who lose aid for administrative reasons such as not completing paperwork and then have aid restored in the near future). Although the broad goal was predetermined, the practitioners leading the project were entirely open to new suggestions of what the forms could look like and also how to conduct the test. In this case, the organization was supplying on-the-ground expertise, as well as knowledge of the history of why the forms looked the way they did and why churn was a problem as a result. The academic partner was supplying both theoretical and technical expertise.

11.4 What Can Be Challenging about Pursuing Organizational Partnerships? Organizational partnerships entail many benefits, yet realizing them typically involves overcoming some challenges as well. I discuss the main challenges in this section. Being aware of them in advance will help provide a firm foundation for success. 11.4.1 Ensuring a Benefit Exchange Although alluded to in the previous section, I should mention this point here as well: a key challenge is ensuring that academics and practitioners both see clear benefits that align with their professional motivations and

How to Form Organizational Partnerships to Run Experiments

incentives. For academics, this means making sure that the study answers research questions that speak to generalizable mechanisms and that contribute to a larger scientific discipline. When they assess the value of doing an experiment, they want to ensure high internal and external validity, as well as the ability to publish data regardless of the findings. Most practitioners are happy to contribute to a body of scientific knowledge, but at the same time their top priority is typically to know what works. They are most likely to partner if doing so will produce a concrete product that will help them directly achieve their goals more effectively. Indeed, as Druckman (2000, p. 1568) notes, for practitioners “explanation is more of a curiosity than a quest: Answers to the evaluative question (Does it work?) take priority over answers to the explanatory question (Why does it work?).” In the realm of experiments, this benefit exchange has a particular manifestation because academics and practitioners start with different orientations. Academics who are interested in conducting experiments want to directly observe behavioral or attitudinal change and be able to attribute it to a well-defined treatment, along with being able to calculate the size of the effect. To them, it seems natural to design an experiment that can isolate the impact of one or a small handful of potential manipulations on an outcome of interest. Yet the starting point for practitioners, especially if they do not have any experience with conducting experiments, is often to think holistically in terms of the wide variety of factors that might explain a particular outcome. This distinction underscores a point I will return to: when setting up new partnerships, academics should be prepared to be ambassadors for good research design (even if this means limiting the scope of what is studied). A final point about the benefit exchange is that academics and practitioners may have different attitudes toward risk when conducting research. Many practitioners are risk-averse, especially about studying programs and/or policies that they are very committed to, have funding for, and have

203

jobs that depend on. Yet academics may feel as though they have career incentives to make their name by criticizing something that has been done in the past. Aligning the benefit exchange when designing a study means being aware of these possibly conflicting motivations and ensuring that all parties are genuinely interested in the results (whatever they happen to be). 11.4.2 Establishing a Coalition of Support When it comes to research, one of the main benefits for academics is being able to set their own agenda and deciding how to allocate their time and which projects to work on. Yet, for practitioners, research projects may not be part of their job description. They have other responsibilities and often are embedded in a larger organizational structure. A new research collaboration may require that academics help build a coalition of support among several decision-makers within a partner organization. Adding to this complexity is a concern about staff turnover, which is especially threatening for experiments with long-lasting treatments, long-term follow-ups, and/or replications. Thus, while this challenge of establishing a coalition of support may often seem like a burden, I encourage academics to view it as helpful for insulating the experiment from organizational changes. 11.4.3 Overcoming Language Differences Experiments entail a certain vernacular – internal validity, external validity, treatment, random assignment, spillover, blocking – that is highly familiar to academics but may sound daunting to others. Academics should be prepared in advance to clarify what these words mean and why they matter not just for designing a sound experiment, but also for practitioners’ goals (i.e., we would not be able to learn something about what works unless we ensure that the experiment has high internal validity, etc.). 11.4.4 Aligning Timelines Another key challenge is that academics and practitioners often work on different

204

Adam Seth Levine

timelines. Designing and carrying out research projects takes time. Academics often do not face immediate deadlines and want to take the time necessary for rigor, such as conducting a pilot study prior to the full experiment. Yet practitioners’ work may be focused on responding to changing circumstances in the world, or at the very least may be far more closely tied to world events like elections, major policy announcements, and national emergencies. Research, quite simply, is often not the highest priority. That said, the large majority of experiments I have conducted with organizational partners proceeded quickly and smoothly. In fact, my first five (with three different nonprofits – two local and one national) moved from initial conversations to data collection within four months. However, academics should be prepared for the possibility that unexpected and uncontrollable events may cause delays. For example, starting in June 2016, I began a partnership with a national organization. During that summer and into the beginning of the fall, we pilot tested various treatments for an experiment that was initially planned to start in winter 2017. However, after Trump’s election in November, my organizational partner had to indefinitely pause our project in order to devote staff resources to newly emerged funding threats. The partnership did not resume until late April 2017, and ultimately data collection did not begin until May 2018. Fortunately, however, there was enough support for the study at all levels of the organization that we were able to move forward even after a lengthy delay. The upshot of these examples is that collaborations may not be ideal if academics face a strict, impending deadline. If at all possible, academics should build plenty of buffer time into their timelines. 11.4.5 Establishing a New Working Relationship Academics and practitioners are generally part of very different social networks. This means that, even if they have a friend or colleague in common, they are unlikely

to personally know each other in advance. Establishing a new working relationship between strangers can be nontrivial. Academics want to ensure that practitioners are committed to the project and all of the specific procedures involved with conducting a sound experiment. Practitioners are often mission-driven and want to ensure academics are committed to their goals, value expertise other than their own, and will be pleasant to interact with (Levine 2020). These latter two points reflect the fact that, as a whole, academics are viewed as highly competent but not always as very friendly or warm (Fiske and Dupree 2014). In sum, I have identified five challenges that new partnerships need to tackle in order to be successful: ensuring a benefit exchange, establishing a coalition of support, overcoming language differences, aligning timelines, and establishing a new working relationship. I will discuss how to do so in a step-by-step guide later in the chapter.

11.5 Ethical Considerations Chapter 7 in this volume discusses ethical considerations to consider with experiments in general. In this section, I briefly discuss ethical considerations as they relate specifically to experiments with organizational partners. One set of ethical considerations that academics should keep in mind relates to the partnership itself. First, academics should minimize harm. They must be mindful that organizations are very concerned about how they are represented in print, due to funding concerns and also, in some cases, physical safety concerns. These considerations affect all aspects of the project, including the conceptualization, design, implementation, and (especially) dissemination of results. Another aspect of minimizing harm is perhaps less obvious. Partnerships often involve a large investment of scarce organizational resources, and academics need to make sure that the study is really worth it. They should not strive to simply causally identify something because it

How to Form Organizational Partnerships to Run Experiments

is possible, but instead ensure that doing so answers a question that it is truly important to answer. Another ethical concern relates to transparency, as the value of scientific results lies in their transparent procedures and ability to be replicated. Take care to ensure that all implementation procedures are clearly documented and followed. Doing so helps avoid situations in which feelings of obligation toward the organization and/or one’s desire for future collaboration get in the way of transparent and honest academic practices. In addition, there are two data-related ethical concerns that should be agreed upon in advance (and, as noted later, codified in writing prior to data collection). One is about data ownership. Oftentimes partners agree that data collected are jointly owned by both of them. That said, academics will need to make sure they have the right to review and publish study details, data, and findings. The second is about the plan for dissemination – how the data and findings will be shared, including the kinds of write-ups that will be produced (in addition to academic papers, perhaps policy briefs, or presentations for funders, etc.). Key parts of the dissemination strategy will include deciding whether the partner organization may be named in print and whether specific partners will coauthor particular documents. My own view is that organizations should choose whether to be anonymous in publications, but not whether academics publish the findings. Lastly, I wish to note that ethical questions may also arise regarding human subjects (i.e., the design and implementation of the intervention). The key issue is that academics may apply different ethical standards to their work than their organizational partners. For instance, they may differ regarding the acceptability of deception and their ability/desire to obtain informed consent. They may also differ as to whether it is ethically defensible to study the impacts of interventions that have “major, direct, and possibly adverse effects on the lives of others” (Humphreys 2011, p. 1). They may

205

raise different questions surrounding the extraction of a control group that remains untreated. In these situations, researchers should strive to reduce risks and costs to subjects. They will also need to decide what ethical grounds, if any, justify their participation (Humphreys 2011, and Chapter 7 in this volume provide helpful guidance for such judgments).

11.6 So You Want to Partner! What Are the steps? Having provided a general discussion of the benefits and challenges of organizational partnerships, as well as several ethical considerations to keep in mind, I now describe the process in more detail. An overview appears in Box 11.1. Academics who want to collaborate typically start with some ideas about a topic and an eagerness to refine those ideas and have them challenged in conversations with a potential partner. As Penuel and Gallagher (2017, p. 36) state, “Each partner must be willing to have the aims of joint work at least partly shaped by the other partner.” With that in mind, the steps are as follows. Box 11.1: Steps in an organizational partnership. (Note: some steps may occur concurrently, as noted in the text.) 1. Have an initial conversation with a potential partner. 2. The “dating phase” (ascertain partner’s willingness and capacity and discuss what an experiment would entail). 3. Put plans in writing. 4. Secure institutional review board approval. 5. Acquire funding (if necessary). 6. Collect data (including a pilot study if desired/feasible). 7. Analyze data and present results. 8. Follow up and possibly do another study together.

206

Adam Seth Levine

11.6.1 Step 1: Have an Initial Conversation with a Potential Partner There is no single best way for potential partners to initially meet each other. Sometimes academics find potential partners via their own preexisting connections through family and friends, sometimes they are introduced via advisors and other colleagues, and sometimes they attend gatherings where they know that many other practitioners will be in attendance (e.g., professional association meetings). They may also cold contact organizations that they are interested in working with. New connections may also arise via social networks or via organizations like research4impact, Evidence in Governance and Politics (EGAP), the MIT GOV/LAB, and Scholars Strategy Network (SSN). Academics may also consider publishing op-eds about existing work, as these may lead practitioners to reach out and want to learn more. Overall, in my experience, initial conversations may be proposed by either academics or practitioners. If academics initiate contact, they should be clear and upfront about why they are specifically interested in working with that organization. Focus on its goals, values, and strategic priorities, along with how your interests, values, and skills align and could be useful. For academics, it is often too easy to frame initial conversations in terms that are most familiar – research questions rooted in the academic literature – and not in terms that are likely to resonate with practitioners. Resist doing so. During these initial conversations, the goal is not to overwhelm potential partners with lots of details about what a study could look like. Rather, the purpose is to establish rapport, learn as much as possible about the organization, and try to identify shared values that can underlie a partnership going forward. During these initial conversations, academics should adopt a relational mindset by using techniques that demonstrate interest in building a working relationship with potential partners. A relational mindset is important because it helps overcome two common problems that often arise in task-related conversations between people

with diverse forms of knowledge. One is self-censorship, in which the people we are speaking with do not feel comfortable sharing what they know and any concerns that they have (Galinsky et al. 2015; Stasser and Titus 2003). The second is that we may (automatically and unconsciously) enter these conversations with stereotypes about who is an “expert” with important knowledge to share. These status-based stereotypes mean that we may not equally recognize everyone’s task-relevant knowledge (in the United States, for example, those with less formal education, women, and racial minorities are often accorded lower status; Ridgeway 2001). Box 11.2 provides an overview of several relationship-building techniques that can help ease self-censorship and reduce the impact of status-based stereotypes.1 I discuss each of them below and in Step 2. First, use “openers” (Miller et al. 1983), in which you invite practitioners to talk about their organization’s history, mission, programs, goals, and previous experiences interacting with researchers/research institutions. Box 11.2: Helpful relationshipbuilding techniques. r Use “openers.” r Practice responsiveness. r Be affirming. r Use metacognitions. r Engage in self-disclosure. r Acknowledge over-time dimension to relationship. r Use legitimation rhetoric. r Provide reasons. r Phrase questions in a way that avoids socially desirable answers.

1 To be sure, academics reading this chapter may also not feel comfortable sharing what they know, and they may also be the target of negative status-based stereotypes by potential partners. Although my discussion in this chapter is addressed to academic readers (i.e., What can they do to minimize self-censorship among practitioners? What can they do to reduce the impact of status-based stereotypes on their own judgments?), my hope is that all partners would employ these techniques as part of a relational mindset.

How to Form Organizational Partnerships to Run Experiments

Take care to directly respond to what they say (Leary 2010) with respectful followup questions that reflect curiosity. One way to demonstrate curiosity is to use metacognitions (Petty et al. 1995), which entail asking people to reflect upon how and why they do what they do (“How did you decide to design the program that way?”). One way to demonstrate respect is to directly affirm what they say, rather than quickly judging it and/or trying to explain it away (Edmondson 1999). For example, suppose you are speaking with people from the National Audubon Society about climate change. They are likely to talk about climate change specifically in terms of its impact on birding and bird conservation. Being responsive in this case means tying responses directly to that concern (“I do not know much about the impact on birds in coastal climates. Please tell me more about that …”), rather than more general considerations about climate change. It also means affirming the belief that the impact on birds is important, as opposed to rushing to the judgment that some other climate change impact should be the focus of the conversation. In addition, listen for emotional responses – “of confusion, concern, or excitement” (Penuel and Gallagher 2017, p. 41) – and pay attention to unfamiliar language and procedures. These are moments either to respond to right away or to refer back to later on, both for clarification and also to further demonstrate that you are responsive, affirming, and curious. Academics should also be prepared to clearly state what they personally want to learn from a partnership, along with relevant background details such as why they care about the topic, what led them to be interested in researching it, and why they are sympathetic to the practitioners’ mission. From a relational perspective, this type of self-disclosure helps establish both trust and liking, which make others more comfortable sharing their own personal information (Miller 2002). In short, a relational mindset entails being interested, not just interesting. Kindness,

207

respect, and actively demonstrating interest in and commitment to the organization’s work and its unique identity are vital. A relational mindset helps establish a level of equity in which all parties talk about, acknowledge, and value the knowledge that everyone brings to the table. Lastly, here are two final thoughts to keep in mind during the initial conversation. One is that it is helpful to get into the habit of keeping written records of communications (including summaries of phone conversations). These notes serve as important memory heuristics for everyone involved, and they also are useful in case of staff turnover or discrepancies down the line. The second is that, assuming the conversation is proceeding well, take care to explicitly signal that you wish to continue interacting (Clark and Lemay 2010). Signaling an over-time dimension may involve asking the partner for his/her preferred next steps, mentioning your own, and suggesting a particular timeframe. 11.6.2 Step 2: The “Dating Phase”: Ascertain Partner’s Willingness and Capacity and Discuss What an Experiment Would Entail Ultimately, academics are looking for a partner who is both “willing and able” (Karlan and Appel 2016, p. 40). Ascertaining both of these attributes often involves lots of questions and many conversations. If the initial conversation from Step 1 seems promising, then follow-ups should delve more deeply into what a partnership might look like. A relational mindset remains vital, as there is still much to learn, talk about, and agree upon. Partner willingness refers to whether a partner genuinely wants to learn something new related to their programs and goals, knowing full well that the study may not turn up what they would hope. Typically, academics are able to ascertain this willingness naturally during the conversations, though there are two specific topics they will want to bring up. One is about what an experiment would actually involve (i.e., designing treatments, randomizing, recruiting a sufficiently large number of study participants, designating a

208

Adam Seth Levine

control group, etc.). During these conversations, academics may need to establish their credibility as a clear and confident advocate for good research design, as the technical details of experimentation may be unfamiliar to partners. Be prepared to potentially explain topics such as causal inference, statistics, internal validity, external validity, instrument design, attrition, spillover, blocking, and so forth in an intuitive, nontechnical manner that is tied to the partnership’s goals. From the perspective of a relational mindset, academics should also be prepared to explicitly provide reasons that justify and explain design decisions (Bastardi and Shafir 2000) – so, rather than saying, “We need to do x,” instead saying, “We need to do x because of reasons a, b, and c.” It is likely that many aspects of experimental design and procedures will raise concerns and questions. Given a relational mindset, hopefully partners feel comfortable raising them. Yet academics can also prompt them in several ways. One way to prompt sharing of concerns is to use legitimation rhetoric that acknowledges and validates the concerns they may have (Levine et al. 2019). One way to prompt questions is to ask them in an inviting manner. Instead of asking, “Is everything clear?” consider asking, “What questions do you have for me?” The latter phrasing signals that you expect the other individual to have questions, which is a reasonable assumption when discussing the technical details of experimentation with those who are unfamiliar with them. It also signals that a lack of clarity is entirely understandable. As conversations proceed (and often before a decision to partner is officially made), academics may get asked to provide an overview of a literature or other aspects of experimental design not unique to the specific study. Be prepared for the possibility of some kinds of “public service” along these lines. You will need to decide for yourself how much you are willing to do before an organization officially decides to partner. The other aspect of partner willingness refers to whether the partner is open to the possibility that the experiment reveals

something they view as unfavorable (such as a null result). Academics should raise this difficult possibility upfront. One way to do so is to talk about the option of what an extended research program might look like. This signals that you are open to the possibility of a long-term relationship, which is helpful for a variety of substantive reasons (e.g., shortening the relationshipbuilding steps for subsequent experiments). It also helps set expectations. If you decide to partner and then obtain an unexpected or unwanted result on the first experiment, then having spoken about a broader research agenda helps situate that result and the need to build on it together, rather than seeing it as the final word. Academics can also couch this discussion in terms of the importance of a “culture of testing,” which avoids a black-and-white “this works and this doesn’t” mindset. These conversations also provide useful moments to advocate for a pilot study (discussed further in Step 6). In addition to partner willingness, academics will need to ascertain organizational ability. This means assessing capacity to conduct an experiment. New research projects are typically not the place to develop entirely new programs. For example, if implementing the experiment will require an army of volunteers, then the partner should have a volunteer program already in place. Also along these lines, academics will want to ask about partners’ previous experience with data collection, recordkeeping, and partnering. They will want to make sure that the partner has experience working with the target population for the study – for instance, that they have access to an appropriate setting for testing the impact of the intervention at an appropriate time and one that is safe and technically feasible given the necessary infrastructure (working phone lines, Internet access, passable roads, etc.). When asking potentially sensitive questions like these about organizational capacity, it is helpful to phrase them in such a way that legitimizes less socially desirable responses (e.g., acknowledging that capacity may be lacking; Tourangeau et al. 2000). This is another aspect of a relational mindset that

How to Form Organizational Partnerships to Run Experiments

helps minimize self-censorship. So, for example, rather than asking, “Please tell me about the staff that could help out,” instead ask, “Please tell me about the staff that could help out, as well as if you think that you may not have enough staff or volunteers and we’ll need to get more.” The former question implicitly signals that you expect there to be enough staff, whereas the latter question acknowledges that there may not be. Being mindful of how you ask questions like these is vital because it is nontrivial for practitioners to respond. They are probably not used to being peppered with questions like these from an “outsider,” and it takes time and energy to respond. Along these same lines, be mindful that partnerships often entail asking staff members and/or volunteers to do things they are not used to doing and are outside their job description. This may entail manually delivering the intervention, tracking subjects, auditing and entering data, and managing staff (Karlan and Appel 2016). That is why researchers should take care, as much as possible, to seek buy-in among organizational leaders, as well as among those who are on the front lines of implementation (at the very least, take care to explicitly acknowledge the extra/different workload and make sure that it is feasible). Another question about organizational ability relates to funding. Academics should inquire about whether outside funding is necessary and/or whether it is already in place (and, if so, what does the funder require?). If funding is not already in place, then who are the likely funders and how long may it take to secure funding? A final consideration related to organizational ability is that these conversations are likely to reveal constraints that affect what experimental designs are feasible. Academics may need to think creatively about how to design around them, perhaps by asking a different question, using standard tools in the experimental design toolkit, and so on. For example, in 2018, I conducted a study of civic leadership. Initial conversations with my organizational partner focused on trying to evaluate the impact of its preexisting

209

leadership training program, yet it became clear that we would be unable to randomize who attended. What we could randomize, however, was whether participants received additional mentoring after the training session. As a result, we shifted the question from one focused on evaluating the impact of the large training session to one focused on evaluating the impact of one-on-one mentorship. Overall, my advice is that academics should be both enthusiastic as well as cautious during the “dating phase.” Again, the overall goals are to ascertain partner willingness and organizational ability. The back-and-forth that occurs can be long and entail uncertain payoffs. This is something that all academics, and especially untenured scholars, need to consider. That said, a good indication that conversations are moving toward a partnership is when both partners are willing to talk about specifics: what the intervention might look like, the context in which it will be delivered, each partner’s responsibilities for conducting the experiment, timing, budget, and so on. The opposite, which could be a lack of responsiveness in general (e.g., not promptly returning emails), an unwillingness to discuss specifics, and/or palpable differences in enthusiasm across levels of the organization, is worrisome. There is no clear line for when researchers should politely walk away from a potential partnership, but at the very least they should always be prepared to do so. 11.6.3 Step 3: Put Plans in Writing If conversations reveal a mutually beneficial research question and feasible study design, then the next step is to codify everything in writing. Box 11.3 provides an overview of what should be written down. The goal is to lay out in very clear terms what will happen: outline the design of the study, how it will be implemented, the responsibilities of each partner throughout the process, how data will be collected, how results will be presented and disseminated, and when the partnership will end. Putting everything in writing helps ensure that everyone is on the same page

210

Adam Seth Levine

and that partners feel mutually accountable to each other. It also offers a reference point in case misunderstandings arise later on.2

Putting things in writing is a key makeor-break moment, as it can involve difficult conversations if you need to secure funding, resolve timeline differences, talk about who will have access to the data and in what form afterwards, and discuss safety concerns. While not typically written down, this step is also a valuable moment to talk about any infrastructure that might be necessary to make the partnership move smoothly (such as check-in routines, use of shared documents, and so on). These conversations are also vital in light of one of the challenges mentioned earlier: in the process of gaining approval on the organizational side, academics often learn more about who the relevant stakeholders are. Obtaining their approval adds time upfront, but also helps to build a coalition of support. A key part of any written document will describe data ownership and dissemination plans. On the former, partners frequently decide that data collected are jointly owned

by both of them. That said, academics will need to ensure that they have the right to review and publish study details, data, and findings. On the latter, partners will also need to speak about dissemination plans (including data and write-ups). This includes the form that the write-ups will take (e.g., typically something other than an academic paper) and how and whether the partner’s name may be used in print (as well as any other identifying information). Practitioners are intensely mission-driven, and so understandably they are very concerned about how their movement and/or organization will be portrayed in print. Moreover, depending upon the nature of the work (e.g., if it involves studying electoral fraud, anti-corruption measures, democracy promotion, and so on), there is the possibility for political sensitivities that will also affect whether the organization wants its name used in print. For these reasons, I mentioned earlier in the ethics section that organizations should choose whether to be anonymous in publications (though not whether academics publish the findings). Written partnership plans can take several different forms. Sometimes they can entail exchanging emails with relevant details and having all parties explicitly respond with their agreement. Other times they can involve more formal documents such as a memorandum of understanding (MOU) or perhaps a binding contract that is sent via university counsel (although MOUs carry a degree of seriousness and mutual respect, they are not legally binding).3 Academics and partners should decide together which type of document they prefer. Academics should also check with others at their university to see if it has any specific requirements. Regardless of the particular written form, the underlying point is the same: it is important to put the broad outlines of what the partnership will entail and the responsibilities of each partner in writing. Yet any document will not be the be-all and end-all. Many final decisions will come afterwards, and partners

2 Lipovsek and Zomer 2019 provide several examples of the types of questions that partners may wish to ask each other when putting plans in writing.

3 Organizations may also ask academic partners to sign a nondisclosure agreement.

Box 11.3: What should be put in writing ahead of time? r Statement of each partner’s goals. r Statement of each partner’s roles and responsibilities (treatment design, implementation, data collection, pilot study, etc.). r Details on study funding and timing. r Data ownership (including right to review and publish study details, data, and findings). r Plan for dissemination of data, findings, and write-ups (including how/whether the organization’s name may be used in print). r Process for ending partnership. r Declaration of any conflicts of interest.

How to Form Organizational Partnerships to Run Experiments

will face unforeseen circumstances. That is why communication lines must be open. The relational mindset described earlier is helpful for when unforeseen circumstances do occur, so that partners feel comfortable raising questions and concerns and have had practice being responsive to each other. Lastly, in addition to written partnership plans, this is also the point in the process when academics will want to file preanalysis plans describing the hypotheses they plan to test and how the data will be analyzed. Preanalysis plans offer an additional form of commitment and expectation setting, among other benefits. 11.6.4 Step 4: Secure Institutional Review Board Approval Parts of Steps 3, 4, 5, and 6 are likely to occur in tandem rather than sequentially. The institutional review board (IRB) process is unlikely to be unique to organizational partnerships per se, though it is possible that some university IRBs will have particular follow-up questions about the organization itself. For instance, they may ask about its goals and tax status4 (e.g., to ensure that university funds are not being used for a research project that will directly benefit a partisan organization). Lastly, note that sometimes university IRBs decide they do not need to review research proposals if the organization is collecting the data as part of its mission. Nevertheless, my advice is for academics to always to request IRB approval just in case.

211

include it as Step 5 because formal funding applications (if needed) may only arise once partners have officially decided to work together. While some partnerships may require new external grants and substantial funding, it is important not to overstate this point. There is often a misconception that experiments with organizational partners involve substantial expenses incurred by researchers, which can deter those who are just starting out. Yet that need not be the case, and often is not, for two reasons. First, some experiments do not require any new out-of-pocket expenditures at all. Instead, they may just require a small change in organizational procedure, such as randomizing something that was not previously being randomized. Second, for experiments that do involve new outof-pocket expenditures, the organizational partner may already have a grant that covers research expenses. For example, I conducted experiments with four different organizations between 2016 and 2018. All of these cost $0 from my research account. Two experiments involved randomly assigning something that had not been randomly assigned in the past. In the other cases, the organizations had preexisting grants for research expenses. To be sure, I have also conducted experiments in which I have spent some money from my personal research budget or applied for small grants on my own to cover expenses, but that has definitely not always been the case.

11.6.5 Step 5: Acquire Funding (If Necessary)

11.6.6 Step 6: Collect Data (Including a Pilot Study If Desired/Feasible)

As noted above, conversations about funding should start well before Step 5, and in particular well before the decision to move forward and put everything in writing. That said, I

Academics need to be actively involved during the implementation and data collection phase. One key aspect of this concerns the randomization. Speaking from personal experience, it is easy for randomization to proceed incorrectly. If at all possible, academics should try to conduct the randomization themselves and provide the implementing partner with a list of who receives the control and treatment. Another important consideration is whether to conduct a pilot study (and, just like with funding discussions, this should

4 This point is especially relevant when partnering with nonprofits, as some are partisan and some are nonpartisan. A detailed discussion of the differences between the various types of nonprofits (501c3s, 501c4s, political action committees, etc.) is beyond the scope of this chapter, but the following website provides a brief overview as a good starting point: www.opensecrets.org/527s/types.php.

212

Adam Seth Levine

be discussed when writing up partnership plans, if not before). Pilot studies (along with spending time in the field ahead of time) are valuable ways to learn about the context, test the feasibility of particular treatments and instruments, and spot problems early on. Sometimes pilot studies can involve many iterations. For instance, I was involved with a door-to-door canvassing experiment that entailed pilot testing various treatments for a total of six months before the final design was agreed upon. In this case, the pilot studies were valuable not only for research purposes, but also for organizational capacity building, as my partner’s staff were heavily involved with designing the pilot and ascertaining feasibility (i.e., what kinds of treatments its volunteers were able to deliver on voters’ doorsteps). Ultimately, the design of the intervention very much reflected their extensive locally rooted expertise along with knowledge drawn from the academic literature. Prior to the full-study implementation, academics must remain mindful that partnerships often entail asking frontline staff, volunteers, and supervisors to engage in new tasks that are not part of their core job descriptions (to echo a point I mentioned earlier when discussing organizational ability). They will need to be clear, and make sure that others within the organization are clear, about why the study must be implemented in a certain way. Keep in mind that from the perspective of organizational staff (and possibly some leadership as well) academics are “outsiders,” and it is likely that at least some people will see the research project as being run by “outside experts.” In addition to having explicit support from organizational leadership, researchers can establish credibility by making sure that every aspect of the rationale for the implementation is made clear. People are more likely to voluntarily comply with a request when they receive reasons for it (Langer et al. 1978).5 As implementation proceeds, partners should be in constant communication with 5 In some cases, partners may consider more concrete incentives for staff and volunteers as well.

updates in order to ensure that matters are proceeding smoothly and data are being recorded consistently, completely, and accurately. Academics are used to thinking about technical failures that can arise with research designs (e.g., insufficient statistical power, poorly worded survey instruments, attrition, noncompliance), yet with partnerships there are many implementation challenges that may arise as well (e.g., staff and volunteers not following protocols correctly; for a detailed overview, see Karlan and Appel 2016, chapters 4 and 5). Be sure to maintain a relational mindset by, for example, phrasing check-in questions in ways that invite concerns to be raised and are not accusatory (as noted earlier in Step 2). Lastly, before data collection closes, partners may ask for updates and/or they may have directly observed field successes and challenges. It is possible to become discouraged at this point, which can lead to disengagment (or worse). Academics cannot wholly avoid this, yet that is why, during Steps 2 and 3, academics should talk openly about the possibility that the results may not be as expected. This point underscores how academics need to work to ascertain whether the partner is genuinely willing to learn something new and possibly unexpected by completing the experiment.

11.6.7 Step 7: Analyze Data and Present Results Once data collection is complete, then the next step is the analysis and write-up. Sometimes academics do the analysis on their own, whereas other times partners work together. Either way, how those results will be disseminated should have been discussed earlier in the process (see Step 3). At least initially, practitioners often want a short presentation, memo, or policy brief. They are happy to cite a peer-reviewed paper later on, but may not want to wait for it. And in any event, they often value something that is shorter and more focused on the takeaway message of “what works,” stripped of the formality associated with situating results in an existing academic literature.

How to Form Organizational Partnerships to Run Experiments

Academics should again be prepared for a variety of reactions to the findings. There may be different levels of emotional investment in the project, especially if it involves an impact assessment (i.e., if it involves directly evaluating the impact of an organization’s existing program, which has direct implications for people’s jobs and livelihoods). This is why difficult conversations about unexpected findings are vital during earlier steps in the process.

11.6.8 Step 8: Follow Up and Possibly Do Another Study Together Continue talking about the data and results, as interaction helps both partners collectively make sense of them. Unless there has been great staff turnover, one main benefit of continuing to work together is that relationships are already in place. In addition, it is also often easier to implement longer-term experiments with preexisting partners.

11.7 Detailed Example: An Organizational Partnership to Study Donation Decisions Having discussed the benefits, challenges, and steps of organizational partnerships in general, in this section I describe one example at length. I discuss several specific details of how the partnership arose and identify broader themes that the example illustrates. This example is not necessarily the most representative, but I choose to focus on it because it was my first one. My hope is that reading about the origin story of my first partnership will be especially useful for readers who are brand new to this kind of work. In the fall of 2011, I was studying why it is difficult to organize people facing economic insecurity. Based on the existing literature, I suspected that people would pay more attention when it is clear how an organization’s work connects to their own personal situation, but at the same time I theorized that some of the common ways that organizations personalize issues might actually be selfundermining. For example, using language

213

about the increasing cost of healthcare might successfully personalize the issue, yet it might also make people feel poor and thus less likely to believe that they can afford to spend money (and even time) on politics. I had already tested this idea via survey experiments, but the context was somewhat artificial for studying action-taking, and I was interested in shortening the distance between the research design and the behaviors I aimed to learn about (see Chapter 12 in this volume on this point). Thus, I wanted to conduct an experiment in a more naturalistic environment, such as giving people the opportunity to take action supporting a real organization working to reduce economic insecurity. While I was aware of many organizations working on a variety of economic insecurity issues, I faced constraints that are common for people who are new to partnerships: I did not have any preexisting relationships with staff at these organizations and I was worried about how long it might take to actually collect data. I was hoping to field this experiment by spring 2012 in order to remain on track with a book manuscript that had a sensitive deadline (in this case, largely stemming from my tenure clock). Given these constraints, I believed that working with a small local nonprofit would be best, as I thought it would be easier to gain access to decision-makers. Given the lack of a large bureaucracy, I also hoped that they might be more amenable to a quick timeline. That said, two potential challenges with small nonprofits are that they are less likely to have preexisting grants that could cover research costs and staff are likely to be stretched especially thin. I started asking friends and colleagues in my small city (Ithaca, NY) if they were involved with any local organizations that they thought might be interested. After several conversations, one friend suggested I contact the Ithaca Health Alliance (IHA), a small local nonprofit that provides health care services and conducts community engagement on health issues. She thought its leadership might be interested because, like many others, they sometimes used personalized language about economic insecurity

214

Adam Seth Levine

in order to build their base of support. I was hoping that they might be interested in working together to study whether this rhetoric was unintentionally harmful (and, if so, what alternatives would be better). Given IHA’s governance structure, she told me that I would likely need approval from both the executive director and also the president of the Board of Directors. The executive director was busy with grant proposals and day-to-day responsibilities, and the Board president was a volunteer with a separate full-time job. Fortunately, I was able to schedule brief meetings with them by January 2012. Our initial conversations were very much getting-to-know-you affairs, focused on personal interests, goals, and values and what the benefit exchange of a partnership would look like. They were also especially interested in knowing about some of the existing literature on this topic. Logistics came later on, and I knew that the “dating phase” was progressing well when they invited me to draft a short written proposal to review. I proposed a very simple two-group experiment, with one control group and one treatment group, in which we would send donation solicitation letters with varying language to potential new supporters. During the “dating phase,” they raised several questions. For example, after I shared my survey-based findings on selfundermining economic insecurity rhetoric, they asked why another study was even necessary. In response, I communicated why I thought it was important to study this question in a more natural setting. Plus, I took care to highlight the obvious benefit: IHA clearly stood to gain from anyone who responded to our solicitation letters, and I was very clear that any language we used in the letters would have to be approved by all parties (thus, while I began our initial conversations with a broad idea, the final study design was a product of everyone’s input). In the process, I also learned a lot about how the organization worked – to what extent they relied on individual donors, how that had changed over time, and why avoiding self-undermining rhetoric was important to them.

Another set of conversations focused on resources. There was no ready-made email list we could freely use for this study. Instead, we would need to do cold mailings with paid postage. I calculated that we would need to send out approximately 3000 of these letters. Via small grants and some money from my preexisting research account, I was able to contribute $2000, which covered the large majority of the costs. IHA had a very small budget, but was willing to pitch in to cover some remaining costs (such as letterhead and envelopes) and commit a small amount of staff time. They were worried about devoting scarce volunteer hours to this study, and so I agreed to do all of the envelope stuffing, stamping, and sealing myself. Lastly, although we discussed the idea of a pilot study, we decided against it given that: (1) the solicitation language was (for the most part) fairly straightforward and (2) we did not expect implementation challenges associated with simply mailing letters and completing standard data entry as responses came in. Ultimately, the executive director and Board president agreed that partnering was worthwhile, and the study was fielded in March 2012. We found that rhetoric focusing on skyrocketing healthcare costs was indeed self-undermining. The data were high quality, and observing behavior in the real world was both exciting and directly relevant to practice. I published the results as part of a book on the politics of economic insecurity (Levine 2015). Meanwhile, IHA gained many new supporters. I also created a separate memo and talk to present to the IHA leadership, per our MOU. I then continued to be in contact with them after the study was over to discuss other possible solicitation strategies, and as it turned out, we partnered on two other experiments. Stepping back from the specific details, this example underscores several attributes that were helpful with moving the partnership forward without lengthy delay: an organization without a large bureaucratic structure, an organization with supportive decision-makers, and a study that did not require new fundraising. At the same time, two attributes arguably added some time

How to Form Organizational Partnerships to Run Experiments

and uncertainty during the “dating phase”: having to build new working relationships from scratch and ensuring a benefit exchange (i.e., ensuring the design was theoretically meaningful and could likely pass peer review and also ensuring it was consistent with the organization’s existing outreach and goals). Overall, the result was an experiment that was a nice example of use-inspired research that advances fundamental understanding (Stokes 2011).

11.8 Conclusion This chapter has provided an approach and a procedural toolkit. The approach underscores the importance of not only adopting a learning mindset when engaging in organizational partnerships, but also a relational mindset that reflects the fact that you are building new working relationships with individuals who have diverse knowledge. This mindset is woven into the step-by-step guide to partnering. Although organizational partnerships certainly entail some challenges, they also offer an exciting opportunity to learn together and to study important behaviors in the real world.

References Allen, Danielle. 2016. “Toward a Connected Society.” In Our Compelling Interests, Eds. Earl Lewis and Nancy Cantor. Princeton, NJ: Princeton University Press, pp. 71–105. Bastardi, Anthony, and Eldar Shafir. 2000. “Nonconsequential Reasoning and Its Consequences.” Current Directions in Psychological Science 9: 216–219. Baumeister, Roy F., and Mark R. Leary. 1995. “The Need to Belong: Desire for Interpersonal Attachments as a Fundamental Human Motivation.” Psychological Bulletin 117: 497–529. Butler, Daniel M. 2019. “Facilitating Field Experiments at the Subnational Level.” Journal of Politics 81: 371–376. Chambliss, E. L., and Bruce V. Lewenstein. 2012. “Establishing a Climate Change Information Source Addressing Local Aspects of a Global Issue: A Case Study in New York State.” Journal of Science Communication 11: 1–7.

215

Clark, Margaret S., and Edward P. Lemay. 2010. “Close Relationships.” In Handbook of Social Psychology, eds. Susan T. Fiske, Daniel T. Gilbert, and Gardner Lindzey. Vol. 2. Hoboken, NJ: John Wiley & Sons, Inc., pp. 898–940. Coburn, Cynthia E., and William R. Penuel. 2016. “Research-Practice Partnerships in Education: Outcomes, Dynamics, and Open Questions.” Educational Researcher 45: 48–54. Dobbins, Maureen, Steven E. Hanna, Donna Ciliska, Steve Manske, Roy Cameron, Shawna L. Mercer, Linda O’Mara, Kara DeCorby, and Paula Robeson. 2009. “A Randomized Controlled Trial Evaluating the Impact of Knowledge Translation and Exchange Strategies.” Implementation Science 4: 61. Druckman, Daniel. 2000. “The Social Scientist as Consultant.” American Behavioral Scientist 43: 1565–1577. Druckman, James N., and Arthur Lupia. 2017. “Using Frames to Make Scientific Communication More Effective.” In The Oxford Handbook of the Science of Science Communication, eds. Kathleen Hall Jamieson, Dan M. Kahan, and Dietram A. Scheufele. New York: Oxford University Press, pp. 351–360. Edmondson, Amy. 1999. “Psychological Safety and Learning Behavior in Work Teams.” Administrative Science Quarterly 44: 350–383. Fiske, Susan T., and Cydney Dupree. 2014. “Gaining Trust as well as Respect in Communicating to Motivated Audiences about Science Topics.” Proceedings of the National Academy of Sciences of the United States of America 111: 13593–13597. George, Alexander L. 1993. Bridging the Gap: Theory & Practice in Foreign Policy. Washington, DC: US Institute of Peace Press. Humphreys, Macartan. 2011. “Ethical Challenges of Embedded Experimentation.” Comparative Democratization 9. Karlan, Dean, and Jacob Appel. 2016. Failing in the Field: What We Can Learn When Field Research Goes Wrong. Princeton, NJ: Princeton University Press. Leary, Mark R. 2010. “Affiliation, Acceptance, and Belonging: The Pursuit of Interpersonal Connection.” In Handbook of Social Psychology, eds. Susan T. Fiske, Daniel T. Gilbert, and Gardner Lindzey. Vol. 2. Hoboken, NJ: John Wiley & Sons, Inc., pp. 864–987. Levine, Adam Seth. 2015. American Insecurity. Princeton, NJ: Princeton University Press. Levine, Adam Seth. 2020. “Research Impact Through Matchmaking (RITM): How and Why to Connect Researchers and Practitioners.” PS: Political Science & Politics 53: 265–269.

216

Adam Seth Levine

Levine, Adam Seth, John Kotcher, Neil Stenhouse, and Edward Maibach. 2019. “Legitimizing Nervousness Motivates People to Take Risky Political Actions.” Working Paper. Lipovsek, Varja, and Alisa Zomer. 2019. “How to Have Difficult Conversations: A Practical Guide for Academic-Practitioner Research Collaborations.” Cambridge, MA: MIT GOV/LAB. Lupia, Arthur, and Mathew D. McCubbins. 1998. The Democratic Dilemma. Cambridge, UK: Cambridge University Press. Miller, Lynn Carol, John H. Berg, and Richard L. Archer. 1983. “Openers: Individuals Who Elicit Intimate Self-Disclosure.” Journal of Personality and Social Psychology 44: 1234–1244. Miller, Norman. 2002. “Personalization and the Promise of Contact Theory.” Journal of Social Issues 58: 387–410.

National Research Council. 2012. Using Science as Evidence in Public Policy. Washington, DC: National Academies Press. Nutley, Sandra M., Isabel Walter, and Huw T. O. Davies. 2007. Using Evidence: How Research Can Inform Public Services. Bristol: The Policy Press. Penuel, William R., and Daniel J. Gallagher. 2017. Creating Research–Practice Partnerships in Education. Cambridge, MA: Harvard Education Press. Stokes, Donald E. 2011. Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, DC: Brookings Institution Press. Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. The Psychology of Survey Response. Cambridge, UK: Cambridge University Press. Watts, Duncan J. 2017. “Should Social Science Be More Solution-Oriented?” Nature Human Behaviour 1: 1–5.

Part III

E X P E R I M E N TA L TRE AT M ENTS A ND M E ASU RES

C H A P T E R 12

Improving Experimental Treatments in Political Science∗

Diana C. Mutz

Abstract This chapter examines experimental treatments and the theoretical, practical, and empirical issues involved in their implementation. I begin by discussing the underlying purpose of experimental treatments. Second, I address what it means to say that a treatment has generalizable effects. Third, I discuss practical issues involved in constructing treatments in a variety of contexts including written, spoken, visual, and behavioral interventions. In the fourth section, I highlight the importance of validating that experimental treatments have induced the intended differences by experimental condition in the independent variable. I point to the general neglect of manipulation checks in experiments in political science and emphasize what can be learned through their inclusion. Contemporary publications provide some evidence of confusion among political scientists about the purposes for which manipulation checks and attention checks are appropriate. In the fifth and final section, I highlight the need for political scientists to move beyond between-subject assignment of treatments to consider far more powerful within-subject and hybrid experimental treatments. Experimentation has proliferated in political science. At the same time, enthusiasm for this method has come with some confusion surrounding the central purpose of these studies. Scholars choose the experimental method because they want to establish a * The author would like to thank Tyler Leigh of the University of Pennsylvania for his research assistance with this chapter.

causal relationship between two concepts. Internal validity remains the raison d’être for social science experiments. Experimental treatments, also known as manipulations or interventions, are integral to the execution of experiments, yet they have received little attention in their own right. Practices within political science to date offer tremendous room for improvement in 219

220

Diana C. Mutz

the quality of treatments used in experimental research. To improve on current practice, I suggest underutilized, alternative ways of administering treatments that can improve the validity of interventions and add to the power and efficiency of experimental designs. My goal is to highlight what makes a high-quality treatment, how to validate it as such, and to offer suggestions on how to go about creating or selecting experimental treatments.

12.1 The Goal of Experimental Treatment The goal of experimental treatment is to create variation in the independent variable in the direction (or directions) intended by the researcher. If one seeks to induce anxiety, for example, in order to examine its effects, then one experimental group must be induced to have greater anxiety than the other. Importantly, that is all that the experimental treatment must accomplish. It is not necessary that it mimic the realworld event that might trigger increased anxiety. Moreover, it is not necessary that people in the real world frequently encounter this treatment or experience this particular reason for a change in anxiety. What is most important about a treatment is that it systematically and substantially change the independent variable in the intended direction. To put this more stylistically, experiments are designed to answer the question, “If x changes, how should y be expected to change?” This is a fundamentally different question from whether the study’s treatment influences x (i.e., whether there was a substantive and statistically significant manipulation of the independent variable). In fact, most experimental textbooks advise pilot testing treatments in advance to make sure they alter the independent variable well before launching an experiment to test a hypothesis about a dependent variable (e.g., Chiang et al. 2015; McLeod 2017). A pilot study is an initial version of an experiment executed on a smaller sample, with the goal of

saving the researcher time and money. Pilot tests are used to make sure the manipulations are working as intended, that there are no floor or ceiling effects in the measurement of key concepts, that participant instructions are fully understood, and so forth. Although experiments can be designed to address external validity in various ways, outside the context of field experiments, treatments themselves are not typically designed to enhance external validity. The external validity of a theory is obviously important, but this is conceptually and empirically distinct from how one goes about manipulating the independent variable. Further, an experiment is not intended to provide estimates of public attentiveness to or the frequency with which people experience a treatment in the real world. Treatments are solely about altering the independent variable in a specified direction. Those randomly assigned to have their levels of some independent variable raised or lowered must, on average, be induced to change. At times, the treatment’s effect on the independent variable and the independent variable’s effect on the dependent variable become conflated in researchers’ minds, thus producing confusion about what is essential. It is important not confuse an experiment examining whether x causes y with a mediation hypothesis. If the point of an experiment is to hypothesize and test a whole causal chain – that is, both that x causes y and that y causes z – then there are multiple independent variables and other important considerations (see Chapter 14 in this volume). A treatment is chosen for its capacity to influence the independent variable, but it is not generally one and the same except in applied contexts. For example, if one wanted to know whether a particular advertisement changed opinions, exposure to that particular ad would be both the treatment and the independent variable. In more theoretical experiments, the two are more distinct. For example, if one used an advertisement as a means of increasing anxiety, which was hypothesized to produce greater intent to vote, exposure to the ad would be the treatment, but knowing whether

Improving Experimental Treatments in Political Science

it successfully enhanced anxiety would be a necessary condition for testing the hypothesis (i.e., the independent variable is anxiety). Because manipulating the independent variable is the central purpose of experimental treatments, the way in which one goes about it is less important than the extent to which, and the precision with which, the treatment is accomplished. Ideally, one manipulates the independent variable only and nothing else that might affect the dependent variable. According to textbooks on experimentation, one of the most common problems when executing experimental studies is weak and ineffective treatments and thus failure to manipulate the independent variable successfully (e.g., Bhattacherjee 2012; Coleman and Montgomery 1993; see also www .psychwiki.com/wiki/Tips_On_Conducting_ Experiments#Pilot_Testing). In my personal experiences teaching experimental design to both undergraduate and graduate students, over 90% of new experimental researchers initially underestimate what it takes to induce substantial variation in the independent variable; changing people’s knowledge, motivations, assumptions, and so forth is often more difficult than it seems. For this reason, researchers should prioritize experimental treatments that create large as well as statistically significant differences between conditions in the independent variable. Hesitancy about using hard-hitting treatments often comes in the form of an appeal to the so-called “real world” or mundane realism (e.g., Kreps and Roblin 2019). What if that particular form of treatment seldom happens in the real world or few people encounter it? If the independent variable seldom varies in the real world, then this concern has merit; but if it simply varies in response to other treatments in the real world, then this is not problematic. Manipulation of the independent variable in an experiment need not occur by means of something that also happens regularly in the real world. In fact, it matters very little how one goes about altering the independent variable so long as it works, and so long

221

as the researcher manipulates strictly the independent variable he/she has in mind and nothing more. To put this more succinctly, treatments do not require mundane realism. If it were possible, it would be perfectly fine to manipulate the independent variable with a brain probe to evaluate its effects on the dependent variable. After all, the researcher is not claiming that people are experiencing brain probes in the real world, only that if that independent variable changes, then it produces the hypothesized consequences for the dependent variable. (As it turns out, a brain probe can, indeed, be used to manipulate generosity; see Christov-Moore et al. 2017.) The issue of whether something that happens in the real world regularly alters levels of the independent variable is a wholly different question. For example, in one of my earlier experiments, my goal was to manipulate generalized social trust in both positive and negative directions in order to examine its effects on economic behavior (Mutz 2005). I reasoned that if observational studies are correct in suggesting that high levels of social trust facilitate what are perceived to be risky economic exchanges, then changing people’s levels of social trust should increase the probability of online buying, whereas lowering levels of social trust should do the opposite. Based on pilot tests, the most effective treatment for inducing lower and higher levels of social trust ended up being a Reader’s Digest article about a study in which wallets full of cash were placed in various cities to see how many of them were returned to their owners, whose identification was included in the wallets. In this story, hidden observers describe watching as people found the wallets and decided how to proceed. Some furtively looked over their shoulders to make sure no one was looking while they absconded with the cash; others immediately tried to call the owner to return it. By editing out either the untrustworthy or trustenhancing examples, I isolated the most effective material for altering levels of social trust in both positive and negative directions. Over a month after exposure to treatment, this sample of people who were prescreened

222

Diana C. Mutz

to never have purchased anything online before were interviewed about their online purchases in the last month, among other kinds of economic decisions. Those assigned to the high social trust conditions were systematically more likely to have made their first online purchase than those in the control condition, and those in the low social trust condition were less likely than those in the control condition to have made their first online purchase (Mutz 2005). By choosing this particular treatment, I was not suggesting that many people still read and have their attitudes influenced by Reader’s Digest, nor was I suggesting that magazine articles were an important determinant of people’s social trust levels. Instead, I was arguing that when social trust increases or decreases for whatever reason, it has consequences for people’s economic behavior. The observational social trust literature tells us that many life events affect levels of social trust. For example, divorce – an exceedingly common event – lowers levels of social trust, and so does being victimized by a crime. The Reader’s Digest article was merely a means to an end – that of altering levels of social trust. Variation in the independent variable is a common phenomenon in the real world, even if this particular means of inducing it is not. To reiterate, the way in which the independent variable is manipulated is not the point; in fact, if a theory is any good, it should not matter how we implement treatment, so long as we do so effectively and in a way that creates the intended variation in our causal concept. Far too often the risk is that the treatment is simply too weak to demonstrate effects on the independent variable, let alone the dependent variable.

12.2 Generalizability of Treatments This brings me to a second common source of confusion: that is, what it means to say an experimental treatment is generalizable. In political science, experimental scholars spend a lot of time and energy emphasizing the

generalizability of results from nonstudent or in some cases representative probability samples (see Chapters 9 and 21 in this volume). However, the generalizability of subject characteristics is only one dimension of the generalizability of an experiment. Interestingly, political scientists almost never address the generalizability of treatments (see Shadish et al. 2002). Instead, when it comes to evaluating generalizability, it is common for people to conflate generalizability with mundane realism, complaining, for example, that a specific operationalization of a treatment is not likely to be found in the real world. Generalizability is not a function of how often a specific treatment happens in the real world; it refers instead to whether other treatments that systematically vary the same independent variable will produce similar effects on the dependent variable. In other words, drawing on the example above, it should not matter what kind of treatment is used to alter people’s levels of social trust – the same consequences should ensue. Generalizability of treatments means not how often this particular stimulus happens in the real world, but whether the same effects occur when that same independent variable is altered in alternative ways, using different forms of treatment. For those engaged in a series of experiments on a given topic, this means it is important to induce variation in the independent variable in multiple ways rather than relying on the exact same operationalization of treatment for study after study. Confusion about the purpose of treatment is particularly problematic when a treatment effectively represents a whole category of possible operationalizations of treatment. For example, if one wanted to draw conclusions about the effects of negative ads on intent to vote or the impact of competitive reality shows on economic attitudes, this issue would arise. The researcher in such a situation has two choices. He or she could select one particular instantiation meeting the necessary and sufficient conditions for “negative ad” or “competitive reality show” and see whether effects occurred from this treatment. However, many would wonder

Improving Experimental Treatments in Political Science

whether the result had something to do with this one particular treatment. Alternatively, one could sample from a pool of qualifying treatments to enhance generalizability. To use this approach, particularly with a complex stimulus such as an advertisement or a television program, one must first carefully define what kinds of competitive reality shows are appropriate for the purposes of this theory. In other words, what is the construct of interest for the purposes of this study? For example, Kim (2019) wanted to study the impact of programs that (1) featured ordinary, noncelebrity participants, who (2) worked hard in competing with one another for (3) some kind of prize of personal economic value. Her hypothesis was that these programs served as modern-day Horatio Alger stories, emphasizing that any ordinary person who simply worked hard could get ahead. As a result, these programs were hypothesized to reinforce beliefs in the American Dream and buttress support for meritocratic economic policies despite growing inequality. Once Kim had carefully defined the kind of reality show that counted for the purposes of her theory, she identified a large set of qualifying programs from the real world of American television. For the purposes of her experiments, subjects assigned to the treatment condition viewed a randomly selected one of many qualifying programs (e.g., Shark Tank, Toy Box, Master Chef, America’s Got Talent), while the control condition subjects viewed another reality show that did not meet these qualifications (e.g., Cesar 911). Using this strategy, the results of the study do not depend on how subjects responded to any one program (which might be a unique or odd example), but instead to how people generally respond to that category of program. This approach is extremely useful when researchers want to test hypotheses about constructs that can take many different forms. For example, conclusions about emotionally arousing persuasive messages should not rest on how people respond to one particular emotionally arousing persuasive message. Instead, a

223

well-defined construct can be represented by a sample of messages all representing that same construct. This approach might seem to violate the requirement that each condition differ in only one, highly controlled characteristic, since there are obviously many possible instantiations of an emotionally arousing message. But if the theory is correct, then the effect should not depend on which treatment is chosen to manipulate the construct of theoretical interest in a particular study. In the example of the reality show treatment, if treatment had been operationalized as exposure to only one (potentially unrepresentative) specific program of this kind, it would be difficult to generalize the study’s findings. On the other hand, if, despite all of the extraneous noise across treatments, the anticipated pattern of results is observed even though different subjects see different programs meeting the program definition, this is in practice a more generalizable finding than one would obtain by using any specific (and potentially unique) example from the broader category. To summarize, the most important characteristic of an experimental treatment is that it induces substantial and observable variation in the independent variable in the directions intended. Intuition is seldom a reliable guide to judging whether or not this has occurred. For this reason, I turn next to how one evaluates experimental treatments.

12.3 Manipulation Checks: Evaluating Experimental Treatments The purpose of a manipulation check is to validate that a treatment has, in fact, induced the intended change in the independent variable in the specific context in which the experiment is administered. If one does not know with confidence that this occurred, then one cannot produce empirical support either in favor of, or disconfirming, a theory. Manipulation checks consist of operationalizations of the independent variable(s). The form they take will differ based on the nature of the independent variable, but they

224

Diana C. Mutz

are an essential part of most experimental designs that are given short shrift in many studies. Without successful manipulation checks, the results of many experimental studies are of little value. For example, even if a drug trial showed positive effects when comparing treatment and control groups, if researchers were uncertain whether those in the control group regularly took the drug or that those in the control group did not take the drug, few would have confidence in the results. In this particular example, taking the drug (or not) is the treatment, so the manipulation check is the same as compliance with the treatment assigned. In many social science experiments, however, the treatment and the independent variable it is designed to vary are not one and the same, so it is important that the manipulation check measures be consistent with the independent variable. For example, if the independent variable is anger, and subjects in the treatment condition are insulted in order to trigger this response, it is not enough to confirm that treatment subjects recall being insulted to a greater extent than control subjects. What must be confirmed is a higher level of experienced anger. My general point is that an interpretation of a change in the dependent variable without documentation of a change in the independent variable is open to speculation. Change has obviously occurred, but we do not know on what basis or for what reason. If one has not effectively manipulated the independent variable, then a theory cannot be tested. Of course, not all experiments are designed to test theories. For example, if an experiment is designed to evaluate whether a specific advertisement changes a specific attitude, exposure to the advertisement is, by definition, the independent variable. In a laboratory setting, no manipulation check would be necessary, since the researcher would know the subjects were exposed to the advertisement. But in a survey experiment, one could not be assured of exposure, so a manipulation check would be essential. In the social trust experiment described above, it took several rounds of pilot testing to arrive at a treatment that created a statistically

significant distinction in levels of social trust between the control condition and the hightrust treatment and the control condition and the low-trust treatment. These were smallscale, informal tests of the strength of potential treatments that were done in advance of the experimental study. For the purposes of operationalizing the manipulation check, I used a series of previously validated questions addressing social trust and combined them to form an index. Although pilot testing treatments may seem like an unnecessary delay in running an experiment, it is well worth doing before wasting time and money on an experiment with an ineffective treatment (see McLeod 2017; Perdue and Summers 1986; Thabane et al. 2010). If one’s pilot test sample is drawn from the same population as the sample used for the experiment itself, then it may not be necessary to conduct a manipulation check during the experiment itself. However, more often than not, scholars use different (less expensive) samples for pilot testing, which makes it difficult to argue that the manipulation will work equally well on a substantively different group of respondents. Most often, manipulation checks should be included in the experiment as well, and they should occur after the dependent variable is assessed so as not to potentially contaminate the central experimental results. Note that the researcher’s aim is not perfection; it is not necessary that every single subject in the low-trust treatment fall to the bottom of the social trust scale, or that those assigned to the high-trust treatment must all end up at the top of the scale. Without knowing where they started on the scale, one can only make aggregate but not individual-level assessments of whether a given treatment “worked.”1 Even those with low levels of social trust may have had their social trust levels raised to a moderate level by the treatment. 1 In between-subjects designs, one can only know whether the treatment has been effective on average, not for specific individuals, since one does not know where each respondent started pretreatment. For dichotomous manipulations, this is similar to the logic behind two-sided noncompliance in encouragement designs.

Improving Experimental Treatments in Political Science

There are some exceptions to the idea that all experiments need manipulation checks. For example, if the treatment is assumed to occur below a respondent’s level of conscious awareness, then a manipulation check is often not possible. Question wording experiments, in which one examines whether subtle differences in question wording make a difference to people’s responses, are examples of such. Likewise, it would be impossible to use manipulation checks in experiments that examine question order effects or in list experiments because there would be no expected differences between conditions. In all of these cases, the treatments are expected to occur below the respondent’s level of conscious awareness, thus limiting their ability to self-report. One additional exception is when the treatment and the independent variable are indistinguishable, as occurs when the experiment is conducted in a setting in which exposure can be validated by an independent observer, such as in a laboratory. As in the advertisement example described above, if the independent variable is whether or not someone has been exposed to a specific program and the researcher can directly observe that this has or has not occurred, then there is little room for doubt. However, if the same independent variable/program exposure is administered in an online survey (or in the field in a design such as that described in Chapter 4 in this volume), there is reason for doubt. In this case, it makes sense to ask questions after the dependent variable has been measured in order to assess whether there is a difference in answers between respondents who have been exposed to a treatment or a control condition.2 However, if the same exact program is a treatment designed to instill anxiety in the participant, then anxiety is the independent variable, not 2 These have been dubbed “factual manipulation checks” by Kane and Barabas (2019), in contrast to “instructional manipulation checks” (i.e., trick survey questions designed to tap attention) and “subjective manipulation checks,” which supposedly tap latent variables. It is a misnomer to call these variables “latent”; that is, they are not being inferred from a statistical model. They are observed via self-report just as other variables are.

225

exposure, and a manipulation check tapping anxiety is still essential, regardless of where the experiment is executed. There is no other way to validate that the treatment did, in fact, create greater anxiety relative to the control condition. Although different types of manipulation check questions will be required based on the independent variables of interest, for most political science experiments, manipulation checks should be included. With so many experiments being conducted online, manipulation checks are particularly important. Despite their centrality to proper interpretation of experimental results, this component of experimental design is often neglected, particularly within political science. How serious a problem is the paucity of manipulation checks within political science? To answer this question, we searched for all articles published in political science journals carried by JSTOR. Publications were included so long as they included either of the words “experiment” or “random” in their abstracts. We then narrowed this sample down to those publications using one or more randomly assigned experiments to make original contributions to knowledge. We eliminated experiments in which a manipulation check would be inappropriate or impossible to do, including question wording or order experiments, list experiments, field experiments,3 or studies in which the treatment was supposed to occur below the subject’s conscious awareness. According to this quick and dirty method, around 60–65% of experiments published in political science journals that should and could have manipulation checks in fact lack 3 Although field experiments theoretically could include manipulation checks, they have been excluded here because, in practice, this seldom makes sense due to the nature of field experimental treatments. Rather than induce a change in some independent variable construct, most field experimental treatments are, in and of themselves, the independent variable. For example, if a field experiment examines the hypothesis that turnout can be increased by sending a reminder postcard the week before an election, then sending the postcard is both the treatment and the independent variable, so there is no need to include a separate manipulation check.

226

Diana C. Mutz

them. This is an extremely large proportion given their centrality to drawing correct causal inferences, and particularly given that these are all experiments published in top journals carried by JSTOR.4,5 Why this obvious oversight on our discipline’s part? I can only speculate, but there are multiple possibilities. First, in reading through the studies described above, it was notable that the test of the treatment’s effect on the dependent variable was often deemed to be the reason for not needing a manipulate check. When a dependent variable was influenced as predicted, it was automatically inferred to be a result of a successfully implemented treatment. This assumption ignores the fact that a treatment could have manipulated something other than what was intended. I explore this possibility further in my discussion of confounding. Second, with perhaps the exception of the political psychology subfield, political science tends to emphasize field experiments in which the effects of complex, “package” treatments are of interest. In these scenarios, researchers may care less about what precisely they have manipulated, so manipulation checks seem less important than validating effects on the dependent variable. 12.3.1 Attention Checks Are Not Manipulation Checks Another possibility that has recently emerged in the literature is the belief that 4 From the earliest JSTOR entries through the date we initiated this search (March 8, 2019), 628 articles met the initial requirement. Upon closer reading, 150 of the studies were not actual experiments, so they were removed from the denominator. The remaining 478 studies were true experiments utilizing random assignment, but some were field experiments, list experiments, or those altering question order or wording, and thus were not amenable to manipulation checks. Of the remaining studies, 35% included manipulation checks in one or more experiments. 5 A similar analysis of JSTOR articles strictly between 2001 and 2015 suggested that only 18% of articles included manipulation checks (Kane and Barabas 2019). I find a higher percentage when considering a much earlier time period. This implies that current scholars may be less likely to employ manipulation checks than earlier experimentalists, even though there are more experiments in political science journals than there were in the past.

manipulation checks can be replaced by attention checks or “screeners,” which are designed to assess how much attention a subject is paying to a study. In comparing manipulation checks and attention checks, Berinsky and colleagues (2014) argue that the two are similar, but that attention checks are preferable because they are “more flexible and less prone to inducing bias.” By attention checks, they mean trick questions embedded in a survey that appear to ask the participant to do one thing, but upon close examination ask them to do another. Figure 12.1 is reproduced from this article as an example. The question shown in Figure 12.1 is intended to be inserted either pre- or posttreatment, and respondents are considered to “pass” such screeners only if they do exactly as directed by the sentence in the middle of the question and select answers that are not their true answer to the substantive question. Attention checks essentially assess whether subjects have paid close attention to the questions being asked. Unfortunately, these two terms – attention checks and manipulation checks – have been confused and conflated. They do not serve the same purpose. The former assess attention levels, while the latter are used to evaluate whether the independent variable differs by experimental condition. This distinction is important because the term “manipulation check” has been used to refer to many pretreatment and posttreatment measures that do not assess the extent of variance across conditions in the independent variable.6 Attention questions are fine if researchers want to assess levels of attention to survey questions among their experimental participants, but they do not replace manipulation checks (i.e., unless the experimental treatment involves altering levels of attention). Attention checks cannot inform the researcher about whether the 6 For example, Kane and Barabas (2019, p. 236) refer to trick survey questions designed to tap attention as “instructional manipulation checks.” Other selfreported measures have been dubbed “subjective manipulation checks,” although the distinction in purpose between “factual” and “subjective” manipulation checks is unclear.

Improving Experimental Treatments in Political Science

227

Figure 12.1 Sample “screener” or attention check question. Note: This example is from Berinsky et al. 2014. Kane and Barabas (2019) refer to this as an “instructional manipulation check.”

independent variable varies by experimental condition, which is the central point of manipulation checks. If manipulation checks are asked before assessing the dependent variable, they “run the risk of priming respondents about the treatment they just experienced, in effect treating them for a second time” (Berinsky et al. 2014, p. 744) or possibly treating even control subjects. For this reason, standard practice is to ask manipulation check questions after the dependent variable is assessed so that these questions cannot possibly affect responses on the dependent variable. It is unclear to me why asking manipulation check questions after the dependent variable is problematic. However, Berinsky and colleagues (2014, p. 744) suggest that “asking the dependent variable before the manipulation check may change responses on that manipulation check – the very measure a researcher needs to identify who is paying attention.” I am not sure how or why the manipulation check might be altered by coming after the dependent variable. But I would argue that manipulation checks are more important than assessments of attention levels, particularly given that the two measures do not serve the same purposes. If a treatment passes a manipulation check, it is safe to assume that respondents collectively paid adequate attention to the treatment. Thus, there is no need for attention checks if one includes a

manipulation check. Attention checks would only be of use if it were impossible to do manipulation checks, such as in cases of treatments that occur below people’s level of conscious awareness. The reverse is not true, however. Just because a respondent gives the answer that the researcher wanted for an attention check question (or for multiple such questions), this does not mean that he/she has been effectively “treated.” High levels of attention do not demonstrate that a treatment has influenced the independent variable. Attention check questions are not capable of demonstrating that the desired distinctions among conditions have been achieved. For these reasons alone, they are not an acceptable replacement for manipulation checks. One argument against using manipulation checks is that a treatment may work in an “online” fashion, meaning that people are exposed to and affected by the treatment as intended, possibly updating their beliefs as a result, but then no longer register having received this treatment by the time they get to the manipulation check. The study of online versus memory-based processing suggests that online processing occurs regularly in the political realm (e.g., Lavine 2002; Lodge et al. 1989). For example, I may have the impression that my local congressman is corrupt, yet not remember precisely what information led me to hold that impression or what he was convicted of. However, given the very short duration of most political science

228

Diana C. Mutz

experiments (under 15 minutes), it seems implausible that subjects would have been given new information and immediately have forgotten it. The contrast between online and memory-based processing involves the distinction between information that is or is not held in long-term memory, not information that is not processed to begin with (Kim and Garrett 2012). If one is concerned that treatment will be short-lived and thus forgotten by the time of the manipulation check, another solution is to randomly assign subjects to receive manipulation check questions either before or after the dependent variable is measured. This design allows for two useful assessments. First, one can observe whether or not the strength of the manipulation check fades quickly, from before the dependent variable is assessed to after it is assessed. Second, with this design, one can evaluate whether a manipulation check that is included before the dependent variable essentially increases the size of the treatment effects on the dependent variable. If there is no difference in either direction, the sample can be collapsed and analyzed as a whole. The tendency to overlook manipulation checks in political science experiments is particularly problematic for studies that seek to interpret null experimental findings (see Franco et al. 2015, and Chapter 19 in this volume). Many factors individually and collectively favor null findings. For example, measurement errors, small sample sizes, and underpowered studies all favor null findings, As a result, substantive claims about null findings must be interpreted with caution. However, if one can demonstrate that a treatment successfully induced the desired change in the independent variable and that the dependent variable was nonetheless unaffected, then at least a stronger case can be made that these are meaningful null results. 12.3.2 The Problematic Practice of Selectively Dropping Experimental Subjects In an era of readily available and inexpensive online experimental subjects, researchers have increasingly wanted to screen individual

experimental subjects out of their sample by criteria such as their post-treatment answers to questions serving as manipulation checks or attention checks (see Montgomery et al. 2018). Their goal is to narrow their sample down to subjects who are more attentive and/or more compliant with treatment. Although this practice is well intentioned, it poses serious threats to internal validity. Once we abandon random assignment to conditions and introduce post-treatment variables as determinants of who shows up in what condition, the probability models normally employed in experimental studies are off the table (see, e.g., Aronow et al. 2015). Between-subjects experiments facilitate aggregate comparisons of randomly assigned groups. When we violate principles of random assignment by eliminating people from their randomly assigned groups based on individual characteristics measured after treatment, we can no longer use the statistical approaches typically used to draw inferences about group differences. Dropping cases means that one can no longer treat the randomly assigned groups as equal in expectation on all other variables, so spuriousness becomes an issue for the interpretation of results. Further, the treatment itself could cause some subjects to be more likely than others to be dropped. Some scholars have suggested that it is most advantageous to have a single question with a dichotomous right/wrong answer as a manipulation check because then the analysis of results can be done on the entire experimental sample, and separately among those who pass the check (e.g., Berinsky et al. 2014; Kane and Barabas 2019). To do so takes the study outside of the experimental paradigm. In a valid experimental analysis, one cannot selectively remove subjects from conditions based on their answers to posttreatment questions. Parallels can be made to evaluating the treatment effects of the treated (average treatment effects on the treated (ATT) and effects of treatment on the treated (TOT)) by analyzing subjects “as treated” rather than “as assigned” (Angrist et al. 1996). But to do so well involves more than simply rerunning the experimental

Improving Experimental Treatments in Political Science

analysis among the subgroup one views as having been treated. If the basis for selecting subjects occurs pretreatment and is determined in advance (e.g., eliminating subjects based on those who take a prespecified extraordinarily short time to complete the experiment), then the subsample can be safely analyzed without risk to internal validity (Aronow et al. 2015). Unfortunately, the cautionary notes emphasized by researchers who advocate analyzing results by who failed/passed manipulation checks or attention checks have been focused on external validity rather than internal validity. Their concerns have centered on skewing a sample demographically since the well-educated, for example, are more likely to read carefully than others and thus are more likely to pass. In reality, internal validity is a far more serious concern, since this practice undermines the whole point of doing an experiment. The practice of excluding respondents from experimental analyses has reached a fever pitch. In one recent study, fully 47% of the original sample was excluded for failing one or another type of check (Earp et al. 2019). The authors’ reason for doing so was that they wanted to replicate a laboratory study. However, given that similar checks were not administered in the laboratory study, one wonders in what sense this makes the samples comparable. Regardless, there is little basis for expecting the conditions to represent randomly assigned groups at this point. To clarify, my argument is not that attention checks should never be used. But attention checks are not the same thing as manipulation checks, so they should not supplant manipulation checks. Successful manipulation of the independent variable is extremely important to a well-executed experiment. Manipulation checks should be required in all but very few kinds of experiments to validate that the independent variable was, in fact, altered by treatments. My second point is that answers to manipulation check questions should not be used as the basis for analyzing results or subdividing experimental samples. If they are going to be the basis for subdividing experimental subjects, attention checks should only occur

229

pretreatment as measures of attention, not as substitutes for manipulation checks. 12.3.3 Confounded Treatments Aside from strengthening internal validity, another use for manipulation checks is in assessing whether one treatment may have confounded another. Two types of confounding can produce inappropriate interpretations of experimental results. First, in a factorial design with multiple treatments, even when all treatments are assigned in a manner so that they are orthogonal to one another, it is possible that a treatment for one factor unintentionally affects another factor. For example, describing a country as a democracy makes experimental subjects more likely to believe that it is wealthy as well (Mutz 2020). If one is determined to know what kind of effects the form of government has independent of the country’s economic well-being, then this confounding is problematic. One obvious solution is to make wealth another experimental factor so that one can control perceptions of wealth while manipulating form of government independently. But even so, those who are told that the target country is a democracy but not wealthy may still perceive the country to be wealthier than one that is described as not a democracy and not wealthy. In many contexts this is unavoidable because people naturally have associations about what goes with what. The advantage of including manipulation checks for all independent variables is that one can know to what extent this has occurred. If these associations cannot be disentangled, there is no easy way to separate the direct effects of democracy from its indirect effects that flow through increased perceptions of wealth (Dafoe et al. 2018). However, manipulation checks can allow one to assess whether and to what extent one treatment is influencing another. Even highly objective manipulations can be interpreted differently. Sometimes these stubborn associations are the whole point of the experiment. For example, if we manipulate the gender of a political candidate described identically in both cases, but the female is perceived to be more liberal

230

Diana C. Mutz

even when she has the same exact issue positions as the male, this is an important finding. It means that a female candidate would need to have more conservative positions in order to be perceived as ideologically equivalent to the male. If one suspects potential confounding based on a manipulated variable, then one can include manipulation checks even for variables that the design does not purposely manipulate. While including manipulation checks should be standard in experimental designs, it requires greater forethought to include manipulation checks for things that are not part of the design that might, nonetheless, be unintentionally manipulated. In one recent study that involved manipulating five different characteristics of potential trading partner countries in a factorial design to describe them as more or less like the USA, I included five different manipulation checks for characteristics such as the country’s perceived military strength, the general health and well-being of its citizens, a democratic form of government, cultural values and language similarity, and the strength of its economy (Mutz 2020). The manipulation checks revealed that several of the country characteristics were confounded; the more similar to the USA a country was described on one dimension, the more similar it was assumed to be on other dimensions, even though the treatments were all orthogonal. Thus, a culturally similar country was assumed to have a higher standard of living, and a country with a higher standard of living and one that was described as a democracy both resulted in greater perceptions of cultural similarity, even though these characteristics were each manipulated independent of one another. Although this study involved only five manipulation checks, I suspect that additional characteristics that I might have tapped that were clearly like or unlike the USA would have produced similar results. When manipulation checks corroborate lack of confounding on other variables, they can be quite useful for providing a clear understanding of cause and effect. But if they suggest that confounding has occurred, as in the example described above, they

provide no easy post hoc cure (Dafoe et al. 2018). Nonetheless, they allow additional insight into the experimental process and the reasons why particular results may have been obtained. For example, if previous studies manipulating whether a country was democratic or not demonstrated a greater desire to trade with democracies, this result could be misinterpreted. A manipulation check examining whether the democracy treatment also affected perceived affluence would help to determine whether the effect occurred only because people inferred that a democracy would be more affluent.

12.4 Improving Experimental Treatments One weakness in how political scientists often approach experimental treatment is that we spend relatively little time and effort producing our experimental treatments. With notable exceptions, treatments in experimental political science tend to be short, fleeting, and weak. Psychologists have called this the “MTurkification” of experimental research (Anderson et al. 2019). As the frequency of studies conducted online has increased, easier-to-execute studies that can be administered to online crowdsourced samples have proliferated. There is nothing inherently wrong with this practice, and it has some beneficial effects, to be sure. Scholars can do more studies in a shorter period of time for less money. However, this approach also encourages weaker treatments. Some psychologists argue that crowdsourced studies have displaced more involved experiments in settings that would require something other than remote, Internet-based interactions. Because requiring more of people’s time typically means spending more of the researcher’s money, one might think that because of online experiments’ low costs relative to other forms of data collection, crowdsourced studies would have the luxury of longer, more involved protocols. In actual practice, they tend to be quite short, with Mechanical Turk (MTurk) studies being the shortest of all (Anderson et al. 2019).

Improving Experimental Treatments in Political Science

Accomplishing a strong experimental treatment in a short time window is understandably challenging, but it may well deserve more effort than it currently receives in political science. The approach most often used to implement treatment – a written statement or description that the subject reads – makes implementing strong and involving treatments exceedingly difficult. This common mode of experimental treatment produces relatively low levels of experimental realism; that is, few subjects will experience this kind of treatment in the same way and with the same impact as they would in real life. Instead of putting time and effort into making sure that a treatment is strong enough to produce significant differences in the independent variable, treatments are often taken for granted, and the focus is strictly on the dependent variable or the sample. In order to produce statistically significant findings with a treatment, online studies often attempt to compensate with larger sample sizes than were common in earlier experimental eras. Scholars concerned about p-hacking have targeted precisely this kind of finding for criticism (i.e., a finding that is statistically significant by traditional standards but that represents a relatively small effect size; Simonsohn et al. 2014). The text we ask people to read is not usually engaging or attention-grabbing. Moreover, when subjects’ motives on crowdsourcing platforms tend to be to get through as many tasks for payment as possible in a short amount of time, this undermines our efforts to involve them in any meaningful way. In addition, repeated use has meant that these subjects have increasingly come to resemble a subject pool in that they are no longer naive experimental subjects (Chandler et al. 2013). Studies do not necessarily need to be labor-intensive in order to produce important findings, but political scientists may be overlooking opportunities for more involving treatments. Informational treatments with simple text may work well if they convey that information clearly, but treatments need not necessarily take this form. Other forms of sensory input can better simulate real-world experience. By using photographs and video

231

found online, experimental treatments can be made far more interesting and involving (e.g., Mutz 2015). Virtually everyone with a cell phone and a laptop can edit video these days (particularly those under 30). Moreover, video and photo editing software put more involving experimental treatments within everyone’s reach. In one excellent example, Hopkins (2015) used video-based experiments to examine whether the presence of more culturally distinctive immigrants (i.e., Latino immigrants with darker skin tones or who speak Spanish) increases opposition to immigration. Using a CBS News video excerpt, he altered both the immigrant’s skin tone and whether they spoke fluent English, broken English, or fluent Spanish. The advantage of treatments that go beyond mere text is not limited to producing a more attentive pool of subjects. Greater sensory input and less abstract treatments also should be more likely to induce the desired variation in the independent variable. This is probably especially important when a treatment involves reactions that are not cognitive or informational in nature. One kind of experimental treatment that is seldom utilized today in experimental political science is an old-fashioned one: faceto-face studies. Consider the contemporary relevance of Nisbett and Cohen’s (1996) studies of the difference between how male northerners and southerners in the USA respond to minor affronts such as being bumped into by a stranger in a hallway or experiencing some other minor affront. Whether in a laboratory, a lab-in-thefield, or some other location, treatments of this kind are seldom done by political scientists today, perhaps because they are more difficult to execute, or perhaps because scholars fear that their institutional review boards might object. Nonetheless, in an era of burgeoning political violence, it would seem important to better understand what people consider an “appropriate” reaction to provocations of various kinds, whether online or in face-to-face settings. The experimental realism of being unnecessarily bumped in a hallway in a face-to-face setting (Nisbett and Cohen 1996) or simply observing someone else being viciously attacked online

232

Diana C. Mutz

is difficult to match. This is what makes these types of interventions powerful sources of treatment. Perhaps too often, experimental subjects in political science are asked to “Consider a hypothetical scenario in which …” or “Imagine that …” Hypotheticals are a common form of experimental treatment within political science. This is a particularly common approach in international relations experiments where complex hypothetical scenarios are often described to respondents in order to obtain their reactions (e.g., Chaudoin 2014; Kertzer and Brutger 2016). In presenting people with such scenarios, researchers should take into account that asking people to think hypothetically is, in itself, an experimental manipulation that may not be present when such a decision is made in real life. When asking subjects to consider conditions that are explicitly hypothetical, we change the way that information is processed. This change is parallel to how people process an object, event, or person that is perceived to be psychologically distant – that is “not present in the direct experience of reality” (Liberman et al. 2007, p. 353). To the extent that we want people to process experimental stimuli as if they were real, this is obviously problematic. As documented in a recent meta-analysis of this literature (Soderberg et al. 2015), when something is described as hypothetical, people’s mental representations of that event or person become more abstract. Their attention is drawn to essential characteristics of the target, with less attention being paid to relevant details. This general finding, known as construal level theory, has been validated in hundreds of articles across a variety of fields (see, e.g., Soderberg et al. 2015; Trope and Liberman 2010; Wakslak et al. 2006). In general, construal level theory supports the idea that the same target can be mentally represented (i.e., construed) at different levels of abstraction. Construal level theory demonstrates that the more removed something is from the self, from the here and now, and from reality more generally, the less concrete specifics will matter and the more abstract principles will be used in

decision-making. As Trope and Liberman (2010) explain, to see the forest, we need to back away from the trees. The consequence of describing something in hypothetical/distant terms is to focus people’s attention on ends rather than means, and on abstractions rather than specific details. These considerations have obvious downstream consequences for political judgments. For example, hypotheticals should encourage more ideologically consistent views rather than decisions based on immediate context (see Ledgerwood et al. 2010a, 2010b, 2012). Although we do not have many direct comparisons, at least one survey experiment conducted using a hypothetical obtained different results from a field experiment involving reports on non-hypothetical decisions (see also LaPiere 1934). Findley and colleagues (2017) conducted parallel experiments requesting anonymous business incorporation from just under 4000 service providers in over 180 countries. They then conducted a survey experiment asking the same population in hypothetical terms about anonymous incorporation. The results suggested far more willingness to do so in the field experiment than in the survey experiment. The hypothetical question was more likely to elicit compliance with abstract norms about corporate transparency, although concerns about social desirability could also explain this difference. At the very least, construal level theory suggests that scholars should use caution when extrapolating results from hypothetical scenarios. If we would like our findings to reflect how people make decisions here and now in their world of direct experience, hypotheticals should be avoided because they put people in a more abstract mindset than is common in day-to-day decision-making and thus may not generalize well to decisionmaking that is not hypothetical.

12.5 Administering Experimental Treatments: The Within-Subjects Option As a practical matter, one reason for the MTurkification of experimental political

Improving Experimental Treatments in Political Science

science is that crowdsourced samples provide a large number of participants, in a hurry, at a relatively low cost. When political scientists with little time and even less money consider their options, the attractions are obvious. But we have reached a point in the evolution of experimental political science where the limitations of this approach are apparent in the form of less involving treatments, non-naive participants, and limited attention. Without MTurkification, it seems doubtful that attention checks and their attendant problems would have emerged. Since identifying shirkers turns out to be of little benefit to researchers (see Berinsky et al. 2016), other options must be considered. At least one alternative to the most common design and treatment format has not yet been fully exploited. To date, experimental treatments in political science are administered almost exclusively as between-subjects treatments; this means that each person (or other unit of analysis) receives only one treatment and the dependent variable is measured only once. As a result, these experiments require relatively large samples and somewhat large effect sizes. For betweengroup differences to stand out relative to large within-group differences means that a study requires many subjects, thus adding to the experimenter’s time and expense. In a within-subject or repeated measures design, treatments are administered more than once to each unit of analysis/person and the dependent variable is measured after each treatment. There are two main advantages to withinsubject designs. First, they require much smaller samples sizes. In addition, because each person (or other unit of analysis) serves as his or her own control, error (most of which comes in the form of within-group variation in between-subject designs) is greatly reduced. This means the experimental design has greater power to identify even subtle effects. For some purposes, a between-subject design is clearly preferable; for example, if the effects of a treatment are long-lasting or not easily reversible – such as learning new information – this approach makes

233

little sense. But for many research questions, between-subjects is a far less powerful design than administering multiple treatments to individual participants. While not all research questions in political science are amenable to withinsubject treatments, many are, at least far more than we currently test in this fashion. The main reason that social scientists avoid within-subject designs is that they are concerned about order effects. For example, in a three treatment design, if each participant is exposed to all three-treatments, one of them must occur first, another second, and another third. The potential existence of order effects (also known as spillover effects) is not, in itself, a good reason to avoid within-subject designs because there are ways in which they can be dealt with in the context of the experimental design – far more approaches than I can cover here.7 I will nonetheless offer a few basic examples to highlight the logic and simplicity of doing so. Fortunately, the same approach that renders between-subject experimental conditions equal in expectation also works for order effects: the power of randomization. It is not necessary to assume that the order of treatments does not matter. Instead, by randomizing the order in which treatments are presented, order effects effectively cancel one another out in the aggregate, or their effects are accounted for in the model as an experimental factor. Researchers can take into account order effects in a variety of ways. Doing so can make the experimental design somewhat more complex, but for the most part all that is necessary is the ability to randomly assign yet another factor or two. This is not difficult with the kind of software commonly used today for experimental designs. Particularly with a design that does not include a large number of different treatments, this is easy to accomplish through either a fully counterbalanced or partially counterbalanced design such as the Latin square design.

7 For a more detailed treatment of this topic, see Alferes (2013).

234

Diana C. Mutz

As an example of a fully counterbalanced design, consider a study focused on understanding how online forms of social approval affect the credibility of factual claims that have political significance (Zheng 2018). The research question under study was whether online social approval/disapproval would enhance or detract from the credibility of the factual information presented. Three treatment conditions were proposed: one in which the statement received overwhelmingly positive comments from others, a second in which it received negative comments, and a third that received no comments of either kind. Obviously, one would not want to ask subjects the exact same question about exactly the same fact three different times. Doing so would seem strange and pointless. But by using multiple target facts that are rotated through the three treatments in different combinations, it is possible to compare the same basic outcome (in this case, the respondent’s belief in the stated fact) under differing conditions of social approval. In addition to the additional power provided by the within-subject design, researchers can also observe any effects of the order of treatments or order of interaction with treatments, since order is its own experimental factor. Using multiple treatments in rotation, this design provides theoretically relevant variation in the independent variable. Moreover, it does so in a realistic context, since people online are seldom exposed to one and only one message at a time in a given online experience. Exposure to many such claims in succession is highly realistic. By using factual claims involving three different issues (a statement about the extent to which gun deaths have increased, a statement about at what fetal age most abortions in the USA take place, and a claim that a recreational marijuana user died from eating a legal marijuana-laced cookie), the experience of reading three online posts on various topics along with accompanying comments seemed quite natural, so people were not sensitized to what was different from one trial to the next.

With three conditions, the number of possible orderings of conditions is 3! = 6.8 If we designate the three conditions (positive, negative, and control for social support) as A, B, and C, then we have ABC, CBA, CAB, BAC, ACB, and BCA. In a fully counterbalanced design, subjects are randomly assigned to one of these six possible sequences, and order becomes a factor in the analysis of the experimental design, so its effects can also be observed. Which statement appears in conditions A, B, and C is also randomly assigned. A series of “Like” or “Dislike” comments was shown under each Facebook post, as is often the case in the real world. After each exposure to a Facebook post, respondents were asked whether they thought a given factual statement was true. In the analysis of this study’s results, each individual’s general tendency to believe or not to believe what they read online drops out of the model, since people’s level of belief in one condition was being compared to their level of belief in a statement shown under different social approval conditions (2018). Analysis techniques must take into account the fact that the same respondent contributes three dependent variables and that these observations are not independent (see Montgomery 2005, ch. 4 for an example). But most statistical packages are set up so that one can designate within-subject versus between-subject factors in an experimental analysis. In this example, as in many other situations involving political assessments, withinsubject designs are appropriate as well as powerful. Just as viewing Facebook posts is something that people do over and over again, most political judgments and decisions are not one-time-only affairs. Indeed, repeated judgments over time are part and parcel of political behavior in the polling booth, across elections, and in other contexts. Likewise, when there are large individual differences as with psychophysiological 8 The number of treatment levels or total conditions dictates the number of possible orderings so that a two-condition design has two possible orderings, a three-condition design has 3! = 6 possible orderings, and a four-condition design has 4! = 24 possible orderings.

235

Improving Experimental Treatments in Political Science

measures, within-subject designs are ideal, since they remove individual differences from the error term in the model and allow subjects to be compared to themselves under varying conditions. One potential problem is that with more treatments (or levels of a single treatment), the number of possible orderings can quickly get out of control. For this reason, partially counterbalanced designs are sometimes employed. For example, in a study of the consequences of incivility in political discourse, each participant viewed four different issue debates, each about a totally different issue (Mutz 2007). The videos held the political content constant, but varied whether the exchange involved incivility or not and whether the camera featured the exchange of views using closeups of the participants’ faces or shots from a more distant camera angle. This design exposed each respondent to four possible conditions and was administered as a withinsubject experiment by having each subject view four five-minute-long exchanges, one representing each of these four conditions. After each segment, subjects reported their issue attitudes and assessed the strengths of the arguments behind each candidate’s viewpoint. Because each exchange involved one of four different issues, respondents were not exposed to any one discussion more than once, yet each subject was in each of the four experimental conditions. Instead of 24 possible orderings of these four possible treatments, a partially counterbalanced Latin square design such as this must meet three key requirements. These requirements are illustrated in Table 12.1, where the notation A, B, C, and D refers to the four treatments and the four columns refer to the ordering of treatments in a sequence. First, each treatment must appear in each of the four possible positions, 1–4. Second, each sequence (row) must include each of the four possible treatments. Third, each treatment must precede every other treatment as often as it follows it. It need not immediately precede or follow each other condition, but it must come before the other condition twice and follow it in sequence

Table 12.1 Partial counterbalancing in within-subject experimental designs. Order within sequence

Sequence 1 Sequence 2 Sequence 3 Sequence 4

First

Second

Third

Fourth

A B C D

B D A C

C A D B

D C B A

Note: Treatments are represented by the letters A, B, C, and D. Columns represent in what order respondents receive a given experimental treatment. Rows represent the four possible sequences to which a respondent may be assigned in a partially counterbalanced design. Each column and row includes each treatment once. In addition, each treatment follows every other treatment as often as it precedes it.

twice.9 Thus, by using only four different sequences – a far cry from 24 – one can achieve partial counterbalancing. Moreover, the sequence to which each respondent was assigned becomes an experimental factor in the analysis, allowing one to account for the variance produced by order effects. The only downside for the researcher is that 16 different versions of the treatment must be created, one for each treatment (four) by each topic (four). Within-subject designs have already been used in some political science experiments (e.g., Kam and Simas 2010; Mutz 2007), but their potential has gone largely untapped, perhaps due to a lack of familiarity with how to deal with order effects. Order effects are typically not insurmountable, and resources are available to identify the ideal randomization strategy for a given design (e.g., Alferes 2013). Ironically, scholars in other disciplines have made use of this approach to study some of the most vexing issues within political science. For example, how does the repetition of false claims online facilitate belief in them, even when people know better (see, e.g., Fazio et al. 2015)? The power of within-subject designs to identify 9 Note that to partially counterbalance requires an even number of experimental conditions.

236

Diana C. Mutz

even subtle effects without requiring huge sample sizes makes them well worth their additional complexity.

12.6 Conclusion Experimental treatments are essentially means to an end, and as such, it is easy to overlook their importance in the experimental enterprise. Maximizing the potential for experiments to contribute to knowledge within political science will require ingenuity in the years ahead. In considering what forms of innovation to pursue, there are several important considerations. First and foremost, experimental treatments need to be capable of producing clear distinctions in levels of the independent variable, and demonstrably so through the use of manipulation checks that operationalize the studies’ independent variables. Second, the extent to which treatments are perceived as real and taken seriously by subjects (i.e., experimental realism) is a problem of as yet unknown proportions that is nonetheless worthy of concern because it is central to accomplishing most research goals. Third, it is critical that the decisionmaking process evoked by the treatment match the level of abstract/concrete thought that occurs when such a decision or judgment occurs in the real world. Finally, today’s political scientists almost exclusively use post-treatment between-subject designs. There are advantages to such designs, to be sure, but other research designs and ways of administering treatments should be considered, particularly given the nature of the research questions commonly asked.

References Alferes, Valentim R. 2013. “Within-Subjects Designs Randomization.” In Methods of Randomization in Experimental Design. Thousand Oaks, CA: SAGE Publications, pp. 65–106. Anderson, Craig A., Johnie J. Allen, Courtney Plante, Adele Quigley-McBride, Alison Lovett, and Jeffrey N. Rokkum. 2019. “The MTurkification of Social and Personality

Psychology.” Personality and Social Psychology Bulletin 445(6): 842–850. Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91(434): 444–455. Aronow, Peter Michael, Jonathon Baron, and Lauren Pinson. 2015. “A Note on Dropping Experimental Subjects Who Fail a Manipulation Check.” URL: https://ssrn .com/abstract=2683588 or http://dx.doi.org/ 10.2139/ssrn.2683588 Berinsky, Adam J., Michele F. Margolis, and Michael W. Sances. 2014. “Separating the Shirkers from the Workers? Making Sure Respondents Pay Attention on Selfadministered Surveys.” American Journal of Political Science 58: 739–753. Berinsky, Adam J., Michele F. Margolis, and Michael W. Sances. 2016. “Can We Turn Shirkers into Workers?” Journal of Experimental Social Psychology 66: 20–28. Bhattacherjee, Anol. 2012. “Social Science Research: Principles, Methods, and Practices.” URL: http://scholarcommons.usf.edu/oa_ textbooks/3/ Chandler, Jesse, Pam Mueller, and Gabriele Paolacci. 2014. “Nonnaïveté among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers.” Behavioral Research Methods 46: 112–130. Chaudoin, Stephen. 2014. “Promises or Policies? An Experimental Analysis of International Agreements and Audience Reactions.” International Organization 68: 235–256. Chiang, I-Chant A., Rajiv S. Jhangiani, and Paul C. Price. 2015. Research Methods in Psychology. Online adaptation of Research Methods of Psychology, by Paul C. Price. URL: https://saylordotorg.github.io/text_researchmethods-in-psychology/ Christov-Moore, Leonardo, Taisei Sugiyama, Kristina Grigaityte, and Marco Iacoboni. 2017. “Increasing Generosity by Disrupting Prefrontal Cortex.” Social Neuroscience 12(2): 174–181. Coleman, Douglas E., and D. C. Montgomery. 1993. “A Systematic Approach to Planning for a Designed Industrial Experiment.” Technometrics 35: 1–27. Dafoe, Alan, Baobao Zhang, and Devin Caughey. 2018. “Information Equivalence in Survey Experiments.” Political Analysis 26(4): 399–416.

Improving Experimental Treatments in Political Science Earp, Brian D., Joshua T. Monrad, Marianne LaFrance, John A. Bargh, Lindsey L. Cohen, and Jennifer A. Richeson. 2019. “Gender Bias in Pediatric Pain Assessment.” Journal of Pediatric Psychology 44(4): 403–414. Fazio, Lisa K., Nadia M. Brashier, B. Keith Payne, and Elizabeth J. Marsh. 2015. “Knowledge Does Not Protect against Illusory Truth.” Journal of Experimental Psychology: General 144(5): 993–1002. Findley, Michael G., Brock Laney, Daniel L. Nielson, and Jason C. Sharman. 2017. “External Validity in Parallel Global Field and Survey Experiments on Anonymous Incorporation.” Journal of Politics 79(3): 856–872. Franco, Annie, Neil Malhotra, and Gabor Simonovits. 2015. “Underreporting in Political Science Survey Experiments: Comparing Questionnaires to Published Results.” Political Analysis 23(2): 306–312. Hopkins, Daniel J. 2015. “The Upside of Accents: Language, Inter-Group Difference, and Attitudes toward Immigration.” Journal of Politics 45(3): 531–557. Kam, Cindy D., and Elizabeth N. Simas. 2010. “Risk Orientations and Policy Frames.” Journal of Politics 72(2): 381–396. Kane, John V., and Jason Barabas. 2019. “No Harm in Checking: Using Factual Manipulation Checks to Assess Attentiveness in Experiments.” American Journal of Political Science 63(1): 234–249. Kertzer, Joshua D., and Ryan Brutger. 2016. “Decomposing Audience Costs: Bringing the Audience Back into Audience Cost Theory.” American Journal of Political Science 60: 234–249. Kim, Eunji. 2019. “Entertaining Beliefs in Economic Mobility.” Dissertation submitted in Political Science and Communication, University of Pennsylvania. Kim, Young Mie, and Kelly Garrett. 2012. “On-line and Memory-Based: Revisiting the Relationship between Candidate Evaluation Processing Models.” Political Behavior 34(2): 345–368. Kreps, Sarah, and Stephen Roblin. 2019. “Treatment Format and External Validity in International Relations Experiments.” International Interactions. DOI: 10.1080/03050629 .2019.1569002. LaPiere, Richard T. 1934. "Attitudes vs. Actions." Social Forces 13(2): 230–237. Lavine, Howard. 2002. “On-line vs. Memorybased Process Models of Political Evaluation.”

237

In Political Psychology, ed. K. R. Monroe. Mahwah, NJ: Erlbaum, pp. 225–247. Ledgerwood, Alison, and Shannon P. Callahan. 2012. “The Social Side of Abstraction: Psychological Distance Enhances Conformity to Group Norms.” Psychological Science 23(8): 907–913. Ledgerwood, Alison, Yaacov Trope, and Shelly Chaiken. 2010a. “Flexibility Now, Consistency Later: Psychological Distance and Construal Shape Evaluative Responding.” Journal of Personality and Social Psychology 99(1): 32–51. Ledgerwood, Alison, Yaacov Trope, and Nira Liberman. 2010b. “Flexibility and Consistency in Evaluative Responding: The Function of Construal Level.” Advances in Experimental Social Psychology 43: 257–295. Lehana Thabane, Jinhui Ma, Rong Chu, Ji Cheng, Afisi Ismaila, Lorena P. Rios, Reid Robson, Marroon Thabane, Lora Giangregorio, and Charles H. Goldsmith. 2010. “A Tutorial on Pilot Studies: The What, Why and How.” BMC Medical Research Methodology 10: 1. Liberman, Nira, Yaacov Trope, and Elena Stephan. 2007. “Psychological Distance.” In Social Psychology: Handbook of Basic Principles, eds. A. W. Kruglanski and E. T. Higgins. New York: The Guilford Press, pp. 353–381. Lodge, Milton, Kathleen M. McGraw, and Patrick Stroh. 1989. “An Impression-Driven Model of Candidate Evaluation.” American Political Science Review 83(2): 399–419. McLeod, Saul A. 2017. Psychology Research Methods. Simply Psychology. URL: www .simplypsychology.org/research-methods.html Montgomery, Douglas C. 2005. Design and Analysis of Experiments. Danvers, MA: John Wiley and Sons. Montgomery, Jacob M., Brendan Nyhan, and Michelle Torres. 2018. “How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It.” American Journal of Political Science 62(3): 760–775. Mutz, Diana C. 2005. “Social Trust and E-Commerce: Experimental Evidence for the Effects of Social Trust on Individual Economic Behavior.” Public Opinion Quarterly 69(3): 393–416. Mutz, Diana C. 2007. “Effects of ‘In-Your-Face’ Television Discourse on Perceptions of a Legitimate Opposition.” American Political Science Review 101(4): 621–635. Mutz, Diana C. 2015. In Your Face Politics: The Consequences of Uncivil Media. Princeton, NJ: Princeton University Press.

238

Diana C. Mutz

Mutz, Diana C. 2020. “Progress and Pitfalls Using Survey Experiments in Political Science.” In Oxford Research Encyclopedia of Politics. New York: Oxford University Press, pp. 1–22. Nisbett, Richard E., and Dov Cohen. 1996. Culture of Honor. Boulder, CO: Westview Press. Perdue, Barbara C., and John O. Summers. 1986. “Checking the Success of Manipulations in Marketing Experiments.” Journal of Marketing Research 23(4): 317–326. Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference, 2nd Ed. New York: Houghton Mifflin. Simonsohn, Uri, L. D. Nelson, and Joseph P. Simmons. 2014. “P-curve: A Key to The File Drawer.” Journal of Experimental Psychology: General 143(2): 534–547.

Soderberg, Courtney K., Shannon P. Callahan, Annie O. Kochersberger, Elinor Amit, and Alison Ledgerwood. 2015. “The Effects of Psychological Distance on Abstraction: Two Meta-Analyses.” Psychological Bulletin 141(3): 525–548. Trope, Yaacov, and Nira Liberman. 2010. “Construal-Level Theory of Psychological Distance.” Psych Review 117(2): 440–463. Wakslak, Cheryl J., Yaacov Trope, Nira Liberman, and Rotem Alony. 2006. “Seeing the Forest When Entry Is Unlikely: Probability and the Mental Representation of Events.” Journal of Experimental Psychology 135(4): 641–653. Zheng, Alina. 2018. “Seeing Is Believing on Social Media: Predetermined by Opinion or Not?” Unpublished manuscript, University of Pennsylvania.

C H A P T E R 13

Beyond Attitudes Incorporating Measures of Behavior in Survey Experiments

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

Abstract As the use of survey experiments has spread throughout political science, experimental designs have grown increasingly complex. Yet, most survey experiments rest on a basic protocol by which treatments are delivered with textual vignettes, and the effects of these interventions are then measured using self-reports of political attitudes or behaviors. We outline several design innovations that allow researchers to move beyond self-reports by directly embedding politically relevant behaviors into survey experiments. As described in this chapter, these innovations enable experimentalists to strengthen the power of their treatments while enhancing the validity of their measures of treatment effects. We document these advances with illustrations drawn from a wide range of studies focusing on exposure to news reports, party polarization, racial prejudice, and physiological arousal. As survey experiments have become standard practice throughout political science, they have grown increasingly complex. Researchers have moved from relatively simple manipulations based on “split ballots” or a small number of vignette treatments (Nosanchuk 1972; Sniderman and Grob 1996) toward high-dimensional designs incorporating multilayered or continuous treatment schemes that simultaneously manipulate many aspects of the experimental

context (Hainmueller et al. 2014; see also Chapter 2 in this volume). Yet, despite these advances in design, much survey experimental work remains wedded to a basic paradigm in which treatments are delivered with vignettes and treatment effects are observed using self-reports of political attitudes or behaviors. In this chapter, we outline several measurement advances that allow researchers to overcome the dependence on self-reports

239

240

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

by directly embedding politically relevant behaviors into survey experiments.1 As we argue throughout this chapter, these advances provide opportunities for researchers to both refine the delivery of experimental treatments and also obtain improved measurement of treatment effects. To distinguish these approaches from existing practice, we begin by reviewing survey experimental designs fashioned around vignette-based treatments and selfreported outcome measures. We then discuss how the incorporation of behavioral indicators can strengthen the impacts of experimental treatments and enhance the external validity of experimental results, and we demonstrate the utility of this approach using illustrations drawn from recent studies. In closing, we discuss the implications of the behavioral approach for the future of experimental research in political science.

13.1 Common Survey Experimental Designs In our working definition, survey experiments are studies in which respondents complete a survey instrument that includes an experimental treatment and a set of outcome measures.2 This general approach has a long history. For decades, public opinion surveys have incorporated split-ballot experiments that manipulate the wording or order of the questions respondents encounter (e.g., Schuman and Presser 1981). In assessing the influence that mechanical features of a survey exert on responses, early survey experiments primarily addressed measurement concerns, although the approach also yielded substantive insights at times (e.g., Mondak 1993). More recently, survey researchers have refocused experiments to

1 We use “behavior” to broadly refer to measurements not captured by self-reports. 2 Survey studies in which the experimental treatments are delivered outside the survey setting (e.g., see Chapters 3–6 in this volume) are beyond our present focus, but the points we make regarding the benefits of behavioral measures also apply in this setting.

shed light on substantive questions (Mutz 2011; Sniderman 2011). The combination of a growing disciplinary interest in causal inference (Druckman et al. 2006) and the increased accessibility of a variety of subject pools – students (Druckman and Kam 2011), workers in online labor markets (Berinsky et al. 2012; Coppock and McClellan 2019; Chapter 9 in this volume), nationally representative samples from survey vendors (Mutz 2011), and political elites (Chapter 8 in this volume) – has contributed to the widespread application of this method to address questions throughout political science (e.g., Hyde 2015; McDermott 2002; Mullinix et al. 2015). While there is considerable variation in their design and implementation, most political science survey experiments have two features in common. First, the experimental treatment is delivered through a text-based vignette. This means, for instance, that participants in an international relations study might read about an escalating international crisis with a foreign country, with elements of this crisis (e.g., the opposing country’s form of government) randomly assigned by the researchers (Tomz and Weeks 2013). Those in a political communication experiment would have viewed a short news story with a randomly assigned issue frame embedded inside the article (Nelson et al. 1997), while the subjects of a comparative politics study might have encountered the biography of a potential welfare recipient, again with aspects of this welfare recipient’s profile randomly assigned (Aarøe and Petersen 2014). Second, following delivery of the treatment, its effect is assessed with an attitudinal outcome. We follow Eagly and Chaiken (1993, p. 1), who define an attitude as a “psychological tendency that is expressed by evaluating a particular entity with some degree of favor or disfavor.” For political scientists, favored attitudinal outcomes are selfreported evaluations of the political targets featured in these treatments. In the case of the international relations study, respondents might be asked whether they favor the use of force against a foreign rival (Tomz and Weeks 2013). Political communication participants

Beyond Attitudes

typically indicate whether they approve or disapprove of some message or hold a particular issue position (Nelson et al. 1997), and those in the comparative politics study might report their degree of opposition to government welfare programs (Aarøe and Petersen 2014). Survey experiments using this established protocol have advanced any number of areas of political science (see Chapter 2 in this volume). At the same time, the dominance of this approach has revealed a number of limitations. First, vignettes constrain how treatments can be delivered, limiting the target subject matter and the potency of treatments used in these studies. Second, using self-reported attitudes as the sole outcome makes it difficult to assess the political relevance of treatment effects because of concerns that participants either intentionally misreport their attitudes or are incapable of accurately reporting on their mental state.3 No matter the underlying reason, self-reports are often at odds with a respondent’s true attitudes and actual behaviors outside the survey setting. We elaborate on each concern below.

13.2 Limitations of Vignette Designs Informational vignettes are long-standing features of survey experimental research in political science (e.g., Alexander and Becker 1978; Nosanchuk 1972). In this approach, respondents encounter treatments that provide a short description of a political scenario that includes a randomized component. This method of treatment delivery has much to recommend it. By providing relevant information about realworld issues and events, vignettes can engage otherwise inattentive survey respondents. The ancillary features of a vignette that are fixed across respondents provide the context needed for a study’s ecological validity and 3 In doing so, we set aside some other important issues, such as concerns that measuring outcomes immediately after a treatment is delivered may exaggerate its effects.

241

help to hold constant subjects’ beliefs about background features of the scenario that are not directly related to the treatment effect of interest (Dafoe et al. 2018). When analyzing results from simple vignette treatments, the differences between various treatment arms are readily interpretable, such as the differences created by varying the party label attached to a policy proposal (Bullock 2011) or changing the gender of a hypothetical political candidate (McDermott 1998). At the same time, the vignette approach imposes a variety of limitations, such as pretreatment effects (real-world exposure to a treatment prior to participation in an experiment; Druckman and Leeper 2012) and confounds in treatment delivery (experimental conditions must be informationally equivalent with respect to the background features of the scenario; Dafoe et al. 2018). Here, we narrow our focus to concerns related to the delivery of experimental treatments and the measurement of their effects. In terms of the delivery of experimental treatments, there are limits on the scope of topics that can be considered in this format. While vignettes can effectively mirror realworld situations in which individuals cannot avoid receiving information about some target event or politician, in the current environment forced exposure to information is rare, since individuals are able to make choices that determine both the content they receive as well as the platform on which it is delivered. In asking participants to take on the role of a mere observer, vignette treatments cannot address scenarios that involve interpersonal interaction (Sinclair 2012). Moreover, by minimizing participants’ personal involvement in the experimental context, vignette designs encourage satisficing (Stolte 1994). Burying the treatment of interest within a longer narrative may also impede a respondent’s ability to focus on a study’s key manipulation (Mutz 2011, pp. 64–65; see also Chapter 12 in this volume). The uninvolving nature of the vignette design is likely to limit the strength of vignette-delivered treatments, an important consideration when studying particularly durable outcome variables.

242

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

After a treatment is delivered in this vignette format, the style of survey experiment we consider here measures its consequences using self-reported attitudes. At a theoretical level, attitudes represent wide-ranging psychological constructs with potential cognitive, affective, and behavioral implications for a person’s future conduct (Eagly and Chaiken 1993). This means that any attitudinal change that occurs during a survey experiment, should it persist, may have broad implications for any subsequent political decision-making or opinion formation that draws upon the attitude examined in a study. More practically, attitudinal outcomes and other self-reports are easily incorporated into surveys, with a substantial body of work to guide researchers toward best practices in their use. However, these benefits are accompanied by several concerns. Most importantly, do the attitudes measured in these studies meaningfully predict political behavior outside the experimental setting? Prior evidence suggests that attitudes often do not translate to behaviors. Audits of attitude– behavior consistency typically find that self-reported attitudes fail to translate into behavior (e.g., Bertrand and Mullainathan 2001; Schwarz 1999; Wicker 1969; for demonstrations of attitude–behavior correspondence, see Eagly and Chaiken 1993; Hainmueller et al. 2015), though even in these instances aggregate-level public opinion may remain important for political representation. Especially stark illustrations of the attitude–behavior gap come from studies of racial prejudice. In a classic study, LaPiere (1934) recruited a Chinese couple as research confederates and followed them across the USA to different restaurants and motels. In the overwhelming majority of cases, the couple obtained services – notwithstanding that the vast majority of the business owners expressed nearly unanimous unwillingness to admit Chinese guests in a survey. To cite another discrepancy from a different setting, the level of voter turnout reported by respondents in the 2012 and 2016 American National Election Study surveys exceeded

the actual level of turnout by nearly 20 percentage points. Beyond this disconnect, the use of attitudes raises more subtle issues when attempting to understand political behavior. Responses to survey questions fluctuate due to stimuli that have little relationship to attitudes, with respondents intentionally distorting the views they report in surveys. To illustrate, again with the subject of race relations, normative pressures often lead even the most bigoted individuals to respond in a race-neutral manner (Gough 1951). These motivated distortions of self-reports extend beyond racial issues. For instance, Bullock et al. (2015) and Prior et al. (2015) show that partisans may cheerlead for their side by knowingly providing incorrect answers to questions about the state of the national economy so as to portray their political party in a favorable light, in the process distorting the measurement of economic perceptions obtained from surveys. On a more technical front, when the survey platform itself highlights the relevant social norms – as may be the case when respondents are interviewed in person – responses conforming to social norms become more frequent. Conversely, in the more anonymous setting of an online selfadministered survey, respondents are more willing to express out-group animus (for evidence of significant mode effects in the American National Election Study surveys, see Iyengar and Krupenkin 2018). Setting aside these motivational concerns, and even assuming a best-case measurement scenario in which respondents attempt accurate reporting, it remains unclear whether individuals have sufficient access to their state of mind, or at least access that they can reasonably articulate in a survey setting, to allow them to report their true attitudes. Experiments on perception and memory demonstrate that the human brain operates outside conscious awareness, and that such unconscious thought and feeling may well be the dominant mode of operation (e.g., Bargh 1999). This insight has led much of the recent work in social psychology to adopt the notion that respondents cannot convey many important

Beyond Attitudes

attitudes because the underlying constructs are implicit in nature (Hassin et al. 2004). The preceding discussion addresses difficulties in attitude measurement that arise in any survey study, experimental or otherwise. But the limitations of self-reported attitudes are especially consequential for survey experimental research. In particular, they raise the very real possibility of false positives in survey experiments, as attitudes and other self-reports may be more malleable than behavior. By focusing primarily on attitudes, particularly less crystallized attitudes (e.g., Peterson and Simonovits 2018), survey experimentalists run the risk of overstating the impact of their interventions and the volatility of political behavior in response to stimuli. Consistent with this point, Webb and Sheeran (2018) conducted a meta-analysis spanning 47 experiments that included both attitudinal and behavioral outcomes. They reported that the magnitude of a treatment effect on attitudes was typically twice as large as its impact on behaviors. In the case of political science research employing survey experiments, the reliance on attitudinal outcomes calls into question the behavioral implications of published findings, such as how voter persuasion efforts often fail to alter behavior (Kalla and Broockman 2018).

13.3 Incorporating Measures of Behavior into Survey Experiments While these twin concerns about treatment delivery and outcome measurement may appear unrelated, in the remainder of this chapter we argue that both can be addressed with a common solution: the introduction of politically relevant behaviors into survey experiments. By “behavior” we mean direct observations of an experimental participant’s choices, actions, or physiological responses. As our later examples make clear, there is a broad set of behavioral alternatives to the vignette-style treatments and self-reported attitudinal outcome measures previously discussed. In terms of treatment delivery, this behavioral approach differs from vignettes

243

because these treatments either stem from the actions of another individual or occur because the experimental participant takes actions that determine the treatment they receive. In terms of outcome measurement, the behavioral approach is distinctive in that researchers directly observe some action on the part of a respondent rather than probing attitudinal responses or asking individuals to self-report on what they would have done outside the survey setting. This call to incorporate behavior into survey experiments is no doubt unsurprising given that other forms of experimental work routinely incorporate behavior in this manner. Lab experimenters frequently use trust or dictator games that require subjects to reveal their preferences for cooperation in an incentivized setting (e.g., Berg et al. 1995; Carlin and Love 2018; Habyarimana et al. 2007). Lab experimenters interested in behavior also leverage the potential for research “confederates” to deliver treatments in real-world interactions (e.g., Asch 1951; Kuo et al. 2017). As a matter of course, field experiments focus on behaviors, such as using interpersonal contact to deliver treatments (Broockman and Kalla 2016) and measuring validated voter turnout as an outcome (Gerber et al. 2008). However, despite these examples, similar behavioral applications remain infrequent in survey experiments, a disconnect that guides our discussion here. Survey experiments also present some unique considerations for the incorporation of behavior compared to other experimental formats. In the remainder of the chapter, we discuss several approaches to inserting politically relevant behaviors into the survey experimental context. Some of these stem from mature research programs, while others are still in the pilot phase. All show promise and illustrate new directions for improving treatment delivery and outcome measurement in survey experiments.

13.4 Behavioral Treatments In the standard paradigm, respondents encounter survey experimental treatments

244

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

through a text vignette. In this section, we discuss two cases of behaviorally inflected survey experimental treatments. In the first case, the behavioral approach helps bring greater ecological validity to the experimental setting. In the second instance, this behavioral approach can provide a new, more powerful means of delivering an experimental manipulation, enabling the consideration of new questions within the survey experimental context. 13.4.1 Consumer Choice and Treatment Delivery: Lessons from Media Effects Research Media effects researchers face a conflicting set of considerations. Survey experiments facilitate randomized exposure to a news story or source, allowing researchers precise control over the messages an individual encounters and enabling robust estimates of any communication effects. However, this design forces individuals to encounter communications they may avoid in realworld scenarios. Forced exposure studies may therefore exaggerate the effects of political messages (e.g., Hovland 1959). To address this issue, a series of survey designs incorporate choice into the survey environment. Arceneaux and Johnson (2013), Levendusky (2013), and Messing and Westwood (2014) deploy interactive experimental settings that enable survey participants to initially choose the types of news content they would like to encounter. By randomly assigning exposure to content conditions after this initial content selection stage, the news exposure experiment that follows can then be examined among those likely to encounter a certain type of content outside of the experimental context. The incorporation of choice yields important insights. For instance, Arceneaux and Johnson (2013) have survey respondents indicate the types of news content they would like to consume. By randomizing news exposure after they elicit these news content preferences, the authors show that the individuals most likely to encounter partisan media content are unlikely to be

polarized by this exposure (since they are already polarized), indicating that the media effects observed in prior studies occur among those with only a low propensity to select this content in real-world settings. On the other hand, Levendusky (2013) shows that even when accounting for selection into content type, partisan media can move the public’s issue attitudes. These choice-based designs are instances of a broader class of patient preference-style treatments (see, e.g., De Benedictics-Kessner et al. 2019; Knox et al. 2019). To date, they have largely been employed in the service of media effects, but they have a number of potential extensions (e.g., choice of political discussion partners or between charitable organizations seeking volunteers). The incorporation of behavior into the experimental design only strengthens the ecological validity of the findings. There are some potential limitations to this behavioral approach that may constrain their subsequent application. In the case of political communication research, it appears that, when offered choice, most participants opt for nonpolitical over political content, making it more difficult (or expensive, in the form of significantly larger samples) for researchers to observe the effects of interest. While the appropriate number of alternatives will likely vary based on the topic under consideration in a given study, some limits on the choice set may be necessary to avoid extensive loss of statistical power. For instance, De Benedictics-Kessner et al. (2019) use a stylized choice set consisting of two political and one entertainment options to capture important aspects of media choice (i.e., the ability to avoid political content), while maintaining statistical power. 13.4.2 Interactivity and Treatment Strength: Using Economic Games to Move Partisan Affect Affective partisan polarization – the divide between the positive feelings of partisans toward the political party they identify with and their negative feelings toward the party they do not identify with – is the focus of

Beyond Attitudes

much recent work in American political behavior (Iyengar et al. 2019; Mason 2018). While a number of survey experiments examine the properties of partisan affect (e.g., Klar et al. 2018; Lelkes and Westwood 2017), there is less research on a closely related question: How does partisan affect influence other political views? We believe that the absence of survey experimental work on this topic is attributable less to scholarly disinterest and more to the difficulty of manipulating partisan affect in a survey setting. In a number of survey experiments, partisan affect has proven resilient to vignettestyle experimental treatments. For example, Levendusky (2018) estimates the effects of a national identity prime on partisan affect. Using a variety of well-powered studies that prime American identity in order to encourage individuals to consider their commonalities with members of the other political party, he detects only a slight decline in affective polarization due to these primes, ranging from 1 to 3 percentage points on a feeling thermometer scale. Suhay et al. (2017) report the results of a vignette experiment in which individuals encountered offensive online comments from an opposing partisan. Here, the treatment effects proved conditional, appearing only among strong partisans. While both cases offer important illustrations of factors that shape partisan affect, the relatively weak treatments limit the researcher’s ability to assess the consequences of any changes in partisan affect for other political outcomes in a “downstream” experimental design that requires a larger initial movement in the causal variable of interest (e.g., Green and Gerber 2002).4 In ongoing work, Westwood and Peterson (2019) introduce an alternative mechanism for shifting partisan affect in surveys. Instead of vignettes, they make use of behavioral games as a method of delivering experimental treatments. These games are commonly used to examine incentivized, behavioral 4 For example, a researcher might wish to study how varying levels in affective polarization affect responses to legislative action, likelihood to vote, or willingness to contribute to campaigns.

245

preferences for out-group cooperation by measuring the amount of money players allocate to others who differ from them on traits such as race, gender or partisanship (Berg et al. 1995; Carlin and Love 2018; Habyarimana et al. 2007). By using trust games to administer the treatment, the researchers can manipulate the valence of a participant’s interpersonal interaction with a member of the other political party. Participants are informed they are playing the game with an out-group member (i.e., someone from a different political party). They are then randomized into either a positively or negatively valenced outcome with the opposing player (i.e., receiving a generous monetary allocation versus receiving nothing). Westwood and Peterson (2019) have employed this approach on a series of large national samples. In contrast to the vignette-based studies, their results show substantial movement in partisan affect on both self-reported attitude measures (feeling thermometers) and behavioral outcomes in subsequent rounds of the games. Participants assigned to the positively valenced outcome express significantly less animus toward the out-party. Manipulation checks show that the treatments are in fact received, and that they do not appear to affect evaluations of groups unrelated to the treatment. The use of economic games is suggestive of the more potent consequences of behaviorally driven experimental treatments in surveys. This format is flexible and can likely be adapted to shift affect toward other social groups by inserting relevant information into the player profiles participants encounter. Most importantly, by creating a powerful manipulation, this procedure sets the stage for research into the downstream consequences of partisan affect within the survey experimental setting. And, while we view the use of economic games as an experimental platform as an exciting development, there are significant tradeoffs involving the costs and time required to train participants. Researchers must first provide detailed instructions and examples, followed by trial runs and comprehension questions – often taking up 10 minutes

246

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

of survey time. The costs of paying the allocations to respondents as a cash bonus are also nontrivial (between $10,000 and $15,000 for 2000 respondents). Finally, some might objects to the use of mild deception in the administration of the treatment (it is necessary to contrive the profiles of the players with whom participants engage). Nonetheless, on balance, this design provides researchers with a potent treatment delivered in the form of behavioral outcomes rather than information vignettes.

13.5 Measurement: Beyond Stated Opinions We now transition from considerations of treatment delivery to the measurement of treatment effects, particularly the incorporation of various forms of behavioral measures as outcomes in the survey experimental setting. This measurement strategy reverses the typical logic of attitude–behavior consistency and the resulting fixation on attitudinal outcomes. As a suggested strategy, we describe an array of behavioral indicators – all of which experimentalists can implement at relatively low cost – intended to shed light on observed rather than stated preferences. We note here that the measures we describe have yet to be routinely incorporated into survey experimental studies. Accordingly, we comment on their measurement properties and discuss their potential use as outcomes to be manipulated in some manner by a surveydelivered treatment. 13.5.1 Addressing Motivational Measurement Bias 1: Implicit Measures One promising family of techniques to address social desirability bias focuses on observing attitudes that are not readily subject to considered cognition. Implicit attitudes are the “[t]races of past experience that mediate favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald and Banaji 1995, p. 8). Unlike attitudinal self-reports where respondents can easily censor responses, implicit measures

assess preferences through response latency in a series of timed trials and are more difficult to game. The general argument is that implicit measures are more accurate than explicit measures since they do not permit active masking of feelings toward groups. Unobtrusive measures such as the Implicit Association Test (IAT) developed by Greenwald et al. (1998) and the brief version or BIAT developed by Sriram and Greenwald (2009) and the Affect Missatribution Procedure (AMP) developed by Payne et al. (2005) are much harder to manipulate than explicit selfreports, producing more valid and less biased results (Asendorpf et al. 2002; Boysen et al. 2006). The full IAT measures the reaction time necessary to associate in-groups and outgroups (such as “Democrat” and “Republican” or “African American” and “European American”) with positive and negative attributes (such as “good” and “bad”). While completing the task, participants are instructed to go as quickly as possible. Since people are able to respond faster to group– attribute pairs for which they have acquired automatic associations, the metric of the IAT compares the time taken to respond to pairings of in-group + good with outgroup + good as well as in-group + bad and out-group + bad. The differences in response times to the group pairings are used to generate an indirect measure of group preference (e.g., do people associate certain words more quickly with certain groups?). Ryan (2017) uses this design to show that response latency for independents is comparable to that of partisans. In other words, even though these citizens declare neutrality, they behave like partisans in implicit tasks. Valentino et al. (forthcoming) use this approach to capture implicit political and racial attitudes to determine the extent to which they are related, and they find a strong relationship between racial identity and partisan bias. Theodoridis (2017) uses this approach to assess the extent to which individuals tie themselves to parties and finds that individuals are strongly linked to political groups. However, the full version of

Beyond Attitudes

the IAT requires more than 15 minutes to administer and is, as a consequence, nearly always done in a lab setting with relatively small sample sizes. To counter these concerns, psychologists have developed (and validated) a brief version (BIAT), which measures the same associations, but with a reduced number of trials. In a BIAT, the participants complete four rounds of 20 timed categorizations, with the first pair of rounds treated as training and the last pair used for scoring the measure of implicit attitudes. The four blocks consist of two repetitions (randomly ordered) of the “in-group + good” block and the “out-group + good” block. In each block, the group not paired with good is grouped with negative words. In one example, a target stimulus is the Democratic mascot and the round pairs Democrats with good. Democratic respondents should more quickly categorize the mascot as “good,” since they have come to associate “good” with Democrats. Conversely, Republican respondents should take more time to associate the Democrat mascot with “good.” Iyengar and Westwood (2015) deploy a version of the BIAT that runs in browsers and allows for large-scale deployment to traditional survey panels with no significant software installation required. In the AMP, a target stimulus is displayed (a partisan symbol or a known political candidate) for a set period of time. This is then followed by the display of a meaningless symbol (e.g., letters from the Chinese alphabet). Participants classify the symbol as relatively pleasant or unpleasant, allowing for measurement of the prime’s effect on the unvalenced symbols. The web version of the 2008 American National Election Study included the AMP (see, e.g., Segura and Valenzuela 2010), and it has been used by scholars to document racial bias in Whites’ political attitudes. Messing et al. (2015) use these data to show that affective responses to Obama’s image are conditional on the extent to which his skin has been darkened or lightened in images generate by the media. There are a variety of measurement techniques to capture implicit attitudes,

247

with several off-the-shelf versions available online. However, until recently (Iyengar et al. 2009; Iyengar and Westwood 2015), deploying customized versions of these tools required individuals to go to a campus lab (Greenwald et al. 1998) or for interviewers to be equipped with a specialized device and necessary software (Segura and Valenzuela 2010). Although implicit measures offer much promise for obtaining readings of group prejudice that are not subject to significant social desirability bias (for discussion of the validity of these approaches, see Asendorpf et al. 2002; Boysen et al. 2006; Iyengar and Westwood 2015), there are lingering questions over the reliability of implicit measures (Bosson et al. 2000; Cunningham et al. 2001). While it is more difficult to fake or manipulate IAT results, familiarity with the task makes some degree of manipulation possible (Steffens 2004). Nevertheless, the predictive validity of racial IATs is significantly higher than that of survey-based measures of prejudice (Greenwald et al. 2009; Iyengar and Westwood 2015). So while not a perfect measure, implicit designs are superior to survey question for sensitive topics. 13.5.2 Addressing Motivational Measurement Bias 2: Behavioral Measures of News Choice The perils of self-reporting apply not only to studies that focus on controversial or sensitive attitude targets, but also to those that focus on more innocuous subject matter. In the case of political communication, researchers have difficulty in measuring media exposure accurately, as survey respondents typically exaggerate their news consumption (Prior 2009, 2013; Vavreck et al. 2007). The most recent manifestation of this methodological weakness concerns the debate over partisan “echo chambers” and their potential role in polarizing the electorate. In the absence of reliable indicators of individuals’ exposure to news sources featuring partisan slant, major survey organizations including the American National Election Study (ANES) and General Social Survey (GSS) have been

248

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

unable to shed light on basic questions of polarization springing from selective exposure to biased news. Fortunately, it is now possible to observe survey respondents’ media behaviors directly, thus bypassing the problems associated with self-reports. The diffusion of online surveys, based on large-scale panels of volunteer participants, has made it possible to merge survey data on self-reported media encounters with behavioral evidence of actual media use. These represent important outcomes for studies of the determinants of news choice, such as studies of partisan selective exposure in which media consumption is the primary outcome of interest, as well as studies that aim to use content consumption as a predictor of other political outcomes (e.g., polarization or turnout). The market research firm YouGov has recruited a subset of their panel to install an application that tracks their web browsing activities. Once installed, the toolbar collects the number of visits individuals make to different web domains and the particular web pages they visit at these domains. Peterson et al. (2019) tracked the web browsing behavior of a sample of YouGov panelists who had installed the toolbar between November 2016 and September 2017 (see also Guess 2018). Over this period, the panelists made 74 million visits to over 330,000 web domains. For the purposes of tracking exposure to news, they examined participants’ visits to 355 news domains. This list consists of the top 100 web domains for news based on overall traffic among their panelists and an additional 255 US-based websites included on the Alexa list of most popular news domains, including the websites of mainstream newspapers and television networks, web aggregators offering content from multiple sources, and other online-only sources of news and political commentary. Measures of web traffic based on the browsing behavior of this sample correlate strongly with other established measures of traffic. For instance, when Peterson et al. (2019) compare the level of traffic to the top 500 websites among their panelists with the browsing behavior of panelists in the Comscore national database, the correlation

for both aggregate traffic share and visits per panelist exceed 0.9. When they limit this validation exercise to the 255 news sites, the convergence is weakened, but remains at acceptable levels (e.g., r = 0.67 between the two sources on the visits per panelist metric). The availability of browsing data makes clear the limitations of self-reports as measures of exposure to online news sources in some settings. For instance, in an ongoing panel study, these researchers were able to link actual visits to news sites with equivalent survey self-reports that asked respondents for the number of days over the previous week in which they visited foxnews.com or cnn.com and other major news sites. In the case of the self-reports, Republicans, on average, reported visiting Fox News on nearly two more days during the week than Democrats. In the case of the behavioral measure of web browsing over this same period, the partisan gap was cut by two-thirds, largely due to overreporting of exposure to Fox News by Republicans. The opposite pattern occurred for other media sites, with overreporting by Democrats leading to exaggerated partisan divides in self-reported visits to the websites of the Washington Post and New York Times. Altogether, this suggests that, beyond any fundamental tendency to overreport exposure to political news of all types (e.g., Guess 2015), directional measurement error in these self-reports of political news consumption exaggerate partisan divides in online news exposure. This outcome measure offers a new possibility for experimental designs focusing on information search and news media choice as outcomes of interest. Researchers might randomly assign participants information about major online news domains (e.g., the partisanship or their audience, as in Weiss 2018) or prime the consideration of different motivations for information search (e.g., based on directional or accuracy motives). The effects of such treatments on subsequent patterns of news selection can then be assessed without the concerns attached to survey-based outcome measures of news choice. This approach is not without some limitations. The technique is obtrusive and

Beyond Attitudes

most survey respondents are unwilling to have their behavior tracked. It is also the case that those who consent to observation might vary from the population in important ways and that they may alter their web behavior because they are aware they are being monitored. Finally, researchers using traffic as a dependent measure must produce survey-based treatments that are powerful enough to shift largely habitual patterns of news exposure. While new treatments focusing on various motivations for news consumption and information search are needed, early considerations of web browsing as an outcome in survey experiments find news choice to be largely nonresponsive to the treatments delivered in these settings (see, e.g., Weiss 2018). 13.5.3 Difficulties in Introspection: Physiological Measures A final behavioral substitute for self-reported attitudes is the physiological responses of participants as they engage with experimental or survey stimuli. Political scientists have a well-established interest in emotional arousal, dating back to the 1980s when the ANES first introduced measures of specific emotions directed at presidential candidates (e.g., Abelson et al. 1982). There has been a long-standing debate in psychology over individuals’ ability to identify or accurately label their emotional state (Schachter and Singer 1962). Similarly, there is continuing disagreement over the extent to which survey respondents’ expressions of hostility toward partisan opponents reflect genuine animus or relatively meaningless “cheerleading.” For political scientists, an accurate measure of a respondent’s emotional state gives leverage in understanding the effects of (negative) campaign advertising, communication from political elites and the effects of emotion in deliberation. Evidence showing correspondence between self-reported emotional arousal and physiological markers of arousal will help to resolve these controversies. In political contexts, researchers largely focus on the extent to which political content (news, advertisements, etc.) induces a

249

physiological response. The skin conductance response/galvanic skin response (a change in the conductivity of the skin) is typically used to assess levels of anxiety or stress responses (e.g., Balzer and Jacobs 2011; Renshon et al. 2015). Cortisol levels in bodily fluids (typically saliva) provide a marker of stress and are used to track the physiological impact of media coverage and election outcomes (e.g., Blanton et al. 2012; French et al. 2014). Changes in heart rate are also used to assess the extent to which political content induces a stress response (e.g., Grabe et al. 2003). While physiological indicators all offer robust behavioral measures of arousal, most require expensive lab equipment and the presence of the study participant in a physical lab. To date, therefore, the use of physiological data has been limited to very small samples. While it is not possible to track cortisol levels or galvanic skin responses outside of the lab, fitness tracking devices (e.g., Apple Watches, FitBits) make it possible to capture heart rate from online survey panels.5 Pulse rate is a well-documented indicator of arousal (see Lang et al. 1990; Schaaff and Adam 2013). Given the diffusion of “wearable” technology, it is now possible to synchronize the precise time at which respondents answer survey questions with a time-stamped record of their pulse rate recorded on personal devices outside of the lab. Westwood and Iyengar (2018) developed a method to connect online survey panels with fitness devices so as to capture physiological data. In effect, they developed an application that captures heart rate data from fitness trackers and merges these data with respondents’ answers to survey questions. The merger of the physiological and survey data is possible because both sources are time stamped. In a proof-of-concept pretest, approximately 40 undergraduates at two institutions completed a 20-minute survey including 5 While some devices can capture skin conductivity, the underlying application programming interfaces are not yet publicly available. We anticipate that this will rapidly change in the years to come.

250

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

many questions taken from the ANES. After putting on a FitBit, participants were directed to minimize all physical activity other than necessary hand movements while completing the online survey. At the conclusion of the study, the researchers downloaded all participants’ heart rate data from FitBit servers using an API – there was no need for the researchers to have access to the FitBit used in the study, which would allow for this design to roll out to national online survey panels. Thus, for each participant, we can track the trajectory of their pulse rate in relation to the onset of particular survey questions. The questions included in the pilot survey included the standard feeling thermometer battery, the specific emotions elicited by Hillary Clinton and Donald Trump, and opinions about a variety of controversial issues. Since we were especially interested in survey content with the potential to elicit physiological arousal, to understand how well physiological measures capture emotional arousal, we also asked respondents to indicate how they felt about a set of images and words known to evoke strong emotional responses (e.g., an image of a puppy, racial epithets). Our expectation was that arousal would peak during this section of the survey, and indeed this was the case. Participants’ heart rates increased by an average of 53 beats per minute while exposed to the treatment. When confronted with stimuli related to Trump, we observed the highest heart rates among our respondents (approximately 112 beats per minute). Although suggestive, these results are limited because the sample was nearly entirely comprised of exceptionally fit (average resting heart rate of 61 beats per minute, which increases the possible range of treatment response) and liberal undergraduate students (approximately 6% of the sample identified as Republicans). While we were able to successfully link the data from the smart watches with participants’ survey responses, heart rate is a relatively slow autonomic response. Changes in pulse rate are measurable over longer periods of time, but it is hard to observe differences in response to questions that follow one another

in close sequence. It is therefore necessary to carefully design a survey so as to embed target stimuli thought to be arousing in blocks that are not contiguous, but separated by blocks that provide non-arousing stimuli. It is also important to consider that many institutional review boards impose stricter review for studies that gather physiological data because of potential ethical concerns. Bearing these limitations in mind, there are numerous possibilities for incorporating these measures into future experimental work. First, they can serve as less expensive substitutes to collecting physiological outcomes for the topics previously examined in lab-based experimental studies (e.g., the effects of political news exposure on emotional affect; Blanton et al. 2012). Second, these measures can be used to study additional topics (e.g., physiological responses to online political incivility) that have not been addressed in previous work.

13.6 Closing Discussion The widespread application of survey experiments represents a significant disciplinary advance. These designs can be used to study a wide array of target populations and address causal questions across the discipline’s subfields. They are also robust to concerns about sponsorship or demand effects, which have been issues for other forms of experimentation (Mummolo and Peterson 2019; White et al. 2018), though they can introduce additional ethical concerns. In addition, ongoing assessments indicate that survey experiments replicate well, even when moving between very different subject pools (Coppock et al. 2018; Mullinix et al. 2015). Despite these advancements, the field of survey experiments suffers from drawbacks associated with the use of vignette treatments and attitudinal outcomes. The former limits the power and scope of experimental treatments, while the latter calls into question the validity of observed treatment effects. As we have suggested in this chapter, both of these limitations can be addressed by incorporating behavioral interventions and

Beyond Attitudes

behavioral outcome measures. While the innovations offered here do not entirely answer fundamental questions concerning the generalizability of experimental findings (Barabas and Jerit 2010; Gaines et al. 2007), they enhance experimental realism and provide practical and low-cost alternatives to the standard vignette paradigm. In closing, it is important to note that the innovations we have described do not currently come with software applications that enable easy implementation. We recognize that we are proposing solutions that require a coding background, but the exponential growth of R and Python as core analytic languages makes this skill set increasingly common among political scientists. We anticipate that this level of training will rapidly diffuse as the next generation of experimentalists comes of age.

References Aarøe, Lene, and Michael Bang Petersen. 2014. “Crowding Out Culture: Scandinavians and Americans Agree on Social Welfare in the Face of Deservingness Cues.” Journal of Politics 76(3): 684–697. Abelson, Robert P., Donald R. Kinder, Mark D. Peters, and Susan T. Fiske. 1982. “Affective and semantic components in political person perception.” Journal of Personality and Social Psychology 42(4): 619. Alexander, Cheryl S., and Henry Jay Becker. 1978. “The Use of Vignettes in Survey Research.” Public Opinion Quarterly 42(1): 93–104. Arceneaux, Kevin, and Martin Johnson. 2013. Changing Minds or Changing Channels?: Partisan News in an Age of Choice. Chicago, IL: University of Chicago Press. Asch, Solomon E. 1951. “Effects of Group Pressure Upon the Modification and Distortion of Judgment.” In Groups, Leadership and Men, ed. Harold Guetzkow. Pittsburg, PA: Carnegie Press, pp. 177–190. Asendorpf, Jens B., Rainer Banse, and Daniel Mücke. 2002. “Double Dissociation Between Implicit and Explicit Personality Self-concept: The Case of Shy Behavior.” Journal of Personality and Social Psychology 83(2): 380. Balzer, Amanda, and Carly M. Jacobs. 2011. “Gender and Physiological Effects in Connecting

251

Disgust to Political Preferences.” Social Science Quarterly 92(5): 1297–1313. Barabas, Jason, and Jennifer Jerit. 2010. “Are Survey Experiments Externally Valid?” American Political Science Review 104(2): 226–242. Bargh, John A. 1999. “The Cognitive Monster: The Case Against the Controllability of Automatic Stereotype Effects.” In Dual-Process Theories in Social Psychology, eds. Shelly Chaiken, and Yaacov Trope. New York: Guilford Press, pp. 361–382. Berg, Joyce, John Dickhaut, and Kevin McCabe. 1995. “Trust, Reciprocity, and Social History.” Games and Economic Behavior 10(1): 122–142. Berinsky, Adam J., Gregory A. Huber, and Gabriel S. Lenz. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20(3): 351–368. Bertrand, Marianne, and Sendhil Mullainathan. 2001. “Do People Mean What They Say? Implications for Subjective Survey Data.” American Economic Review 91(2): 67–72. Blanton, Hart, Erin Strauts, and Marisol Perez. 2012. “Partisan Identification as a Predictor of Cortisol Response to Election News.” Political Communication 29(4): 447–460. Bosson, Jennifer K., William B. Swann Jr., and James W. Pennebaker. 2000. “Stalking the Perfect Measure of Implicit Self-esteem: The Blind Men and the Elephant Revisited?” Journal of Personality and Social Psychology 79(4): 631. Boysen, Guy A., David L. Vogel, and Stephanie Madon. 2006. “A Public Versus Private Administration of the Implicit Association Test.” European Journal of Social Psychology 36(6): 845–856. Broockman, David, and Joshua Kalla. 2016. “Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing.” Science 352(6282): 220–224. Bullock, John. 2011. “Elite Influence on Public Opinion in an Informed Electorate.” American Political Science Review 105(3): 496–515. Bullock, John G., Alan S. Gerber, Seth J. Hill, and Gregory A. Huber. 2015. “Partisan Bias in Factual Beliefs about Politics.” Quarterly Journal of Political Science 10(4): 519–578. Carlin, Ryan E., and Gregory J. Love. 2018. “Political Competition, Partisanship and Interpersonal Trust in Electoral Democracies.” British Journal of Political Science 48(1): 115–139. Coppock, Alexander, Kevin J. Mullinix, and Thomas J. Leeper. 2018. “Generalizability of Heterogeneous Treatment Effect Estimates Across Samples.” Proceedings of the National

252

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

Academy of Sciences of the United States of America 115(49): 12441–12446. Coppock, Alexander, and Oliver McClellan. 2019. “Validating the Demographic, Political, Psychology and Experimental Results Obtained from a New Source of Online Survey Respondents.” Research & Politics 6(1): 1–14. Cunningham, William A., Kristopher J. Preacher, and Mahzarin R. Banaji. 2001. “Implicit Attitude Measures: Consistency, Stability, and Convergent Validity.” Psychological Science 12(2): 163–170. Dafoe, Allen, Baobao Zhang, and Devin Caughey. 2018. “Information Equivalence in Survey Experiments.” Political Analysis 26(4): 399–416. De Benedictics-Kessner, Justin, Matthew A. Baum, Adam J. Berinsky, and Teppei Yamamoto. 2019. “Persuading the Enemy: Estimating the Persuasive Effects of Partisan Media with the Preference-Incorporating Choice and Assignment Design.” American Political Science Review 113(4): 902–916. Druckman, James N., and Cindy D. Kam. 2011. “Students as Experimental Participants.” In Cambridge Handbook of Experimental Political Science, eds. James N. Druckman, Donald P. Green, James Kuklinski, and Arthur Lupia. Cambridge, UK: Cambridge University Press, pp. 41–57. Druckman, James N., Donald P. Green, James Kulinski, and Arthur Lupia. 2006. “The Growth and Development of Experimental Research in Political Science.” American Political Science Review 100(4): 627–635. Druckman, James, and Thomas Leeper. 2012. “Learning More From Political Communication Experiments: Pretreatment and Its Effects.” American Journal of Political Science 56(4): 875–896. Eagly, Alice H., and Shelly Chaiken. 1993. The Psychology of Attitudes. San Diego, CA: Harcourt Brace Jovanovich. French, Jeffrey A., Kevin B. Smith, John R. Alford, Adam Guck, Andrew K. Birnie, and John R. Hibbing. 2014. “Cortisol and Politics: Variance in Voting Behavior is Predicted by Baseline Cortisol Levels.” Physiology & Behavior 133: 61–67. Gaines, Brian J., James H. Kuklinksi, and Paul J. Quirk. 2007. “The Logic of the Survey Experiment Reexamined.” Political Analysis 15(1): 1–20. Gerber, Alan S., Donald P. Green, and Christopher W. Larimer. 2008. “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” American Political Science Review 102(1): 33–48.

Gough, Harrison G. 1951. “Studies of Social Intolerance: III. Relationship of the Pr scale to Other Variables.” Journal of Social Psychology 33(2): 257–262. Grabe, Maria Elizabeth, Annie Lang, and Xiaoquan Zhao. 2003. “News Content and Form: Implications for Memory and Audience Evaluations.” Communication Research 30(4): 387–413. Green, Donald P., and Alan S. Gerber. 2002. “The Downstream Benefits of Experimentation.” Political Analysis 4(10): 394–402. Greenwald, Anthony G., Debbie E. McGhee, and Jordan L. K. Schwartz. 1998. “Measuring Individual Differences in Implicit Cognition: The Implicit Association Test.” Journal of Personality and Social Psychology 74(6): 1464. Greenwald, Anthony G., and Mahzarin R. Banaji. 1995. “Implicit Social Cognition: Attitudes, Selfesteem, and Stereotypes.” Psychological Review 102(1): 4. Greenwald, Anthony G., T. Andrew Poehlman, Eric Luis Uhlmann, and Mahzarin R. Banaji. 2009. “Understanding and using the Implicit Association Test: III. Meta-analysis of Predictive Validity.” Journal of Personality and Social Psychology 97(1): 17. Guess, Andrew M. 2015. “Measure for Measure: An Experimental Test of Online Political Media Exposure.” Political Analysis 23(1): 59–75. Guess, Andrew M. 2018. “(Almost) Everything in Moderation: New Evidence on Americans’ Online Media Diets.” Working paper. Habyarimana, James, Macartan Humphreys, Daniel N. Posner, and Jeremy M. Weinstein. 2007. “Why Does Ethnic Diversity Undermine Public Goods Provision?” American Political Science Review 101(4): 709–725. Hainmueller, Jens, Daniel J. Hopkins, and Yeppei Yamamoto. 2014. “Causal Inference in Conjoint Analysis: Understanding Multidimensional Choices via Stated Preference Experiments.” Political Analysis 22(1): 1–30. Hainmueller, Jens, Dominik Hangartner, and Yeppei Yamamoto. 2015. “Validating Vignette and Conjoint Survey Experiments Against Real-World Behavior.” Proceedings of the National Academy of Sciences of the United States of America 112(8): 2395–2400. Hassin, Ran R., James S. Uleman, and John A. Bargh. 2004. The New Unconscious. Oxford: Oxford University Press. Hovland, Carl I. 1959. “Reconciling Conflicting Results Derived from Experimental and Survey Studies of Attitude Change.” American Psychologist 14(1): 8–17.

Beyond Attitudes Hyde, Susan D. 2015. “Experiments in International Relations: Lab, Survey and Field.” Annual Review of Political Science 18(1): 403–424. Iyengar, Shanto, Kyu Hahn, Christopher Dial, and Mahzarin R. Banaji. 2009. “Understanding explicit and implicit attitudes: A comparison of racial group and candidate preferences in the 2008 election.” In Conference Proceedings from the American Political Science Association. URL: http://pcl.stanford.edu/research/2010/iyengarunderstanding.pdf Iyengar, Shanto, and Masha Krupenkin. 2018. “The Strengthening of Partisan Affect.” Political Psychology 39(S1): 201–218. Iyengar, Shanto, and Sean J. Westwood. 2015. “Fear and Loathing Across Party Lines: New Evidence on Group Polarization.” American Journal of Political Science 59(3): 690–707. Iyengar, Shanto, Yphtach Lelkes, Matthew Levendusky, Neil Malhotra, and Sean J. Westwood. 2019. “The Origins and Consequences of Affective Polarization in the United States.” Annual Review of Political Science 22: 129–146. Kalla, Joshua, and David Broockman. 2018. “The Minimal Persuasive Effects of Campaign Contact in General Elections: Evidence from 49 Field Experiments.” American Political Science Review 112(1): 148–166. Klar, Samara, Yanna Krupnikov, and John Barry Ryan. 2018. “Affective Polarization or Partisan Disdain? Untangling a Dislike for the Opposing Party from a Dislike of Partisanship.” Public Opinion Quarterly 82(2): 379–390. Knox, Dean, Teppei Yamamoto, Matthew A. Baum, and Adam J. Berinsky. 2019. “Design, Identification, and Sensitivity Analysis for Patient Preference Trials.” Journal of the American Statistical Association 114: 1532–1546. Krosnick, Jon A. 1999. “Survey Research.” Annual Review of Psychology 50(1): 537–567. Kuo, Alexander, Neil Malhotra, and Cecilia Hyunjung Mo. 2017. “Social Exclusion and Political Identity: The Case of Asian American Partisanship.” Journal of Politics 79(1): 17–32. Lang, Peter J., Margaret M. Bradley, and Bruce N. Cuthbert. 1990. “Emotion, Attention, and the Startle Reflex.” Psychological Review 97(3): 377. LaPiere, Richard T. 1934. “Attitudes vs. Actions.” Social Forces 13(2): 230–237. Lelkes, Yphtach, and Rebecca Weiss. 2015. “Much Ado About Acquiescence: The Relative Validity and Reliability of Construct-Specific and Agree-Disagree Questions.” Research & Politics 2(3): 1–8. Lelkes, Yphtach, and Sean J. Westwood. 2017. “The Limits of Partisan Prejudice.” Journal of Politics 79(2): 485–501.

253

Levendusky, Matthew. 2013. “Partisan Media Exposure and Attitudes Toward the Opposition.” Political Communication 30(4): 565–581. Levendusky, Matthew. 2018. “Americans, Not Partisans: Can Priming American National Identity Reduce Affective Polarization?” Journal of Politics 80(1): 59–70. Mason, Lilliana. 2018. Uncivil Agreement: How Politics Became Our Identity. Chicago, IL: University of Chicago Press. McDermott, Monika L. 1998. “Race and Gender Cues in Low-Information Elections.” Political Research Quarterly 51(4): 895–918. McDermott, Rose. 2002. “Experimental Methods in Political Science.” Annual Review of Political Science 5(1): 31–61. Messing, Solomon, Maria Jabon, and Ethan Plaut. 2015. “Bias in the Flesh: Skin Complexion and Stereotype Consistency in Political Campaigns.” Public Opinion Quarterly 80(1): 44–65. Messing, Solomon, and Sean J. Westwood. 2014. “Selective Exposure in the Age of Social Media: Endorsements Trump Partisan Source Affiliation When Selecting News Online.” Communication Research 41(8): 1042–1063. Mondak, Jeffrey. 1993. “Public Opinion and Heuristic Processing of Source Cues.” Political Behavior 15(2): 167–192. Mullinix, Kevin J., Thomas J. Leeper, James N. Druckman, and Jeremy Freese. 2015. “The Generalizability of Survey Experiments.” Journal of Experimental Political Science 2(2): 109–138. Mummolo, Jonathan, and Erik Peterson. 2019. “Demand Effects in Survey Experiments: An Empirical Assessment.” American Political Science Review 113(2): 517–529. Mutz, Diana C. 2011. Population-Based Survey Experiments. Princeton, NJ: Princeton University Press. Nelson, Thomas E., Rosalee A. Clawson, and Zoe M. Oxley. 1997. “Media Framing of a Civil Liberties Conflict and its Effect on Tolerance.” American Political Science Review 91(3): 567–583. Nosanchuk, Terrance A. 1972. “The Vignette as an Experimental Approach to the Study of Social Status.” Social Science Research 1(1): 107–120. Payne, B. Keith, Clara Michelle Cheng, Olesya Govorun, and Brandon D. Stewart. 2005. “An Inkblot for Attitudes: Affect Misattribution as Implicit Measurement.” Journal of Personality and Social Psychology 89(3): 277. Peterson, Erik, and Gabor Simonovits. 2018. “The Electoral Consequences of Issue Frames.” Journal of Politics 80(4): 1283–1296. Peterson, Erik, Sharad Goel, and Shanto Iyengar. 2019. “Partisan Selective Exposure in

254

Erik Peterson, Sean J. Westwood, and Shanto Iyengar

Online News Consumption: Evidence from the 2016 Presidential Campaign.” Political Science Research and Methods. DOI: 10.1017/ psrm.2019.55. Prior, Markus. 2009. “Improving Media Effects Research Through Better Measurement of News Exposure.” Journal of Politics 71(3): 893–908. Prior, Markus. 2013. “Media and political polarization.” Annual Review of Political Science 16: 101–127. Prior, Markus, Gaurav Sood, and Kabir Khanna. 2015. “You Cannot Be Serious: The Impact of Accuracy Incentives on Partisan Bias in Reports of Economic Perceptions.” Quarterly Journal of Political Science 10(4): 489–518. Renshon, Jonathan, Jooa Julia Lee, and Dustin Tingley. 2015. “Physiological Arousal and Political Beliefs.” Political Psychology 36(5): 569–585. Ryan, Timothy J. 2017. “How do Indifferent Voters Decide? The Political Importance of Implicit Attitudes.” American Journal of Political Science 61(4): 892–907. Schaaff, Kristina, and Marc T. P. Adam. 2013. Measuring emotional arousal for online applications: Evaluation of ultra-short term heart rate variability measures. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. Piscataway, NJ: IEEE, pp. 362–368. Schachter, Stanley, and Jerome Singer. 1962. “Cognitive, Social, and Physiological Determinants of Emotional State.” Psychological Review 69(5): 379. Schuman, Howard, and Stanley Presser. 1981. Questions and Answers in Attitude Surveys. Cambridge, MA: Academic Press. Schwarz, Norbert. 1999. “Self-Reports: How the Questions Shape the Answers.” American Psychologist 54(2): 93–105. Segura, Gary M., and Ali A. Valenzuela. 2010. “Hope, Tropes, and Dopes: Hispanic and White Racial Animus in the 2008 Election.” Presidential Studies Quarterly 40(3): 497–514. Sinclair, Betsy. 2012. The Social Citizen: Peer Networks and Political Behavior. Chicago, IL: University of Chicago Press. Sniderman, Paul M. 2011. “The Logic and Design of the Survey Experiment.” In Cambridge Handbook of Experimental Political Science, eds. James N. Druckman, Donald P. Green, James Kuklinski, and Arthur Lupia. Cambridge, UK: Cambridge University Press, pp. 102–114. Sniderman, Paul M., and Douglas B. Grob. 1996. “Innovations in Experimental Design in

Attitude Surveys.” Annual Review of Sociology 22(1): 377–399. Sriram, Natarajan, and Anthony G. Greenwald. 2009. “The Brief Implicit Association Test.” Experimental Psychology 56(4): 283–294. Steffens, Melanie C. 2004. “Is the Implicit Association Test Immune to Faking?” Experimental Psychology 51(3): 165–179. Stolte, John F. 1994. “The Context of Satisficing in Vignette Research.” Journal of Social Psychology 134(6): 727–733. Suhay, Elizabeth, Emily Bello-Pardo, and Brianna Maurer. 2017. “The Polarizing Effects of Online Partisan Criticism: Evidence from Two Experiments.” International Journal of Press/Politics 23(1): 95–115. Theodoridis, Alexander G. 2017. “Me, Myself, and (I), (D), or (R)? Partisanship and Political Cognition through the Lens of Implicit Identity.” Journal of Politics 79(4): 1253–1267. Tomz, Michael R., and Jessica L. P. Weeks. 2013. “Public Opinion and the Democratic Peace.” American Political Science Review 107(4): 849–865. Valentino, Nicholas A. et al. Forthcoming. “Blue Is Black and Red is White? Affective Polarization and the Racialized Schemas of US Party Coalitions.” American Journal of Political Science. Vavreck, Lynn et al. 2007. “The exaggerated effects of advertising on turnout: The dangers of self-reports.” Quarterly Journal of Political Science 2(4): 325–343. Webb, Thomas L., and Paschal Sheeran. 2018. “Does Changing Behavioral Intentional Engender Behavior Change? A Meta-Analysis of the Experimental Evidence.” Psychological Bulletion 132(2): 249–268. Weiss, Rebecca. 2018. “Computational Social Science for Communication Research.” PhD thesis, Stanford University. Westwood, Sean, and Erik Peterson. 2019. “Compound Political Identity: How Partisan and Racial Identities Overlap and Reinforce.” Unpublished paper. White, Ariel, Anton Strezhnev, Christopher Lucas, Dominika Kruszewska, and Connor Huff. 2018. “Investigator Characteristics and Respondent Behavior in Online Surveys.” Journal of Experimental Political Science 5(1): 56–67. Wicker, Allan W. 1969. “Attitudes versus Actions: The Relationship of Verbal and Overt Behavioral Responses to Attitude Objects.” Journal of Social Issues 25(4): 41–78.

Part IV

E X P E R I M E N TA L A N A LY S I S A N D P RE S E NTATIO N

C H A P T E R 14

Advances in Experimental Mediation Analysis∗

Adam N. Glynn

Abstract Mediation analysis has been called “harder than it looks” (Bullock and Ha 2011) due to difficulties of experimental identification. However, recent work has clarified that, while hard, some experimental designs for mediation can be informative. Other recent work has provided “easier” substitutes for mediation analysis. This chapter has two goals. First, to summarize some of the findings published since Bullock and Ha (2011) and to consider the implications these findings have for mediation analysis. Second, to consider the situations in which a close alternative to mediation analysis would be useful (either as a supplement or a substitute). Such situations often depend on the motivation for the analysis.

14.1 Introduction Mediation analysis is the study of the pathways by which a cause exerts its effect. Once we establish that a treatment or action A has an effect on an outcome Y , we may want to establish the portion of this effect that goes indirectly through a mediator M (the indirect effect) and the portion of the effect * The author thanks Hsu Yumin Wang and Elisha Cohen for their research assistance as well as Teppei Yamamoto, John Bullock, James Druckman, Donald Green, and the participants of the 2019 book conference at Northwestern University for their helpful suggestions and comments.

that does not go through M (the direct effect). A graphical depiction of these effects can be seen in Figure 14.1. One recent example in political science is a survey experiment on public preferences for Supreme Court nominees (Acharya et al. 2018; Sen 2017). The studies find that nominee race A has an effect on preferences Y , such that Democratic respondents prefer Black nominees to white nominees. However, the studies also explore how much of this effect is due party membership M that respondents impute due to race. That is, if respondents are informed that both nominees are Democrats, there 257

258

Adam N. Glynn indirect (stage 1)

indirect (stage 2)

direct

Figure 14.1 Graphical depiction of direct and indirect (through M) effects of A on Y .

is little difference in support for white and Black nominees. In informal terms, we may wonder whether Democratic respondents really support Black nominees or whether they just support Black nominees because they think they are Democrats. The classical statistical approach to addressing this type of mediation question is often attributed to Duncan (1966) or Baron and Kenny (1986). In its simplest form, what is sometimes known as the Baron and Kenny difference method works in the following manner: if we have data on a treatment or action A, a potential mediating variable M, and an outcome Y , the total effect of A on Y can be estimated by the coefficient on A from the linear regression of Y on A (e.g., regressing preference on nominee race), the direct effect of A on Y can be estimated by the coefficient on A from the linear regression of Y on A and M (e.g., regressing nominee preference on nominee race with nominee party held constant), and the indirect effect of A on Y through M can be estimated by the difference between these coefficients. MacKinnon (2012) provides a textbook presentation of this classical approach. Problems with the classical approach were raised in Robins and Greenland (1992), which showed that the Baron and Kenny method would be biased in a wide variety of circumstances and provided an alternative method and a set of sufficient assumptions. Pearl (2001) revived the counterfactual analysis of the subject, providing a nonparametric “mediation formula” and an alternative identification criterion. Still, mediation analysis continued to proliferate in some fields without much consideration of these issues (Bullock et al. 2010). Within political science, Imai et al. (2011) provided a simulation-based nonparametric method, an alternative identification criterion, and a method for sensitivity analysis to deal

with unmeasured pretreatment confounders of the mediator–outcome relationship. For our purposes, this middle period of mediation analysis ends with the mediation chapter in the Cambridge Handbook of Experimental Political Science (Bullock and Ha 2011). That chapter emphasized the importance of experimental manipulation of the mediator and the difficulties that arise from heterogeneous effects. Since the mediator often cannot be randomized and effects are often heterogeneous, this chapter was aptly titled. Advances in mediation analysis since the publication of Imai et al. (2011) and Bullock and Ha (2011) provide a more hopeful and yet still complex situation. Within political science, Imai and Yamamoto (2013) provide a method of sensitivity analysis that allows for measured post-treatment confounders of the mediator–outcome relationship. Whereas the sensitivity analysis of Imai et al. (2011) was not robust to post-treatment confounders of the mediator and outcome (see figure 8 in the appendix of that paper), Imai and Yamamoto (2013) provide a sensitivity analysis that is robust to some post-treatment measured confounders of the mediator and outcome (although not to unmeasured post-treatment confounders).1 Imai et al. (2013) demonstrate that with some experimental designs, including designs where the meditator can only be indirectly manipulated, it is possible to bound indirect effects away from zero. Hence, in some cases, it is possible to experimentally establish the existence of mediated effects. Furthermore, Acharya et al. (2016) show that the eliminated effect (which has an interpretation related to indirect effects) may be a useful alternative target of estimation. Outside of political science, a number of different (and sometimes more general) identification criteria, methods of estimation, sensitivity analyses, alternative targets of estimation, and approaches to dealing with multiple mediators have been developed. This work is largely summarized in the VanderWeele (2015) textbook, and political 1 Within the mediate R package, the multimed function should be used instead of the mediate function to incorporate measured post-treatment confounders of the mediator–outcome relationship.

Advances in Experimental Mediation Analysis

scientists considering mediation analysis are strongly advised to consult this book. 14.1.1 Chapter Goals and Motivating Examples The goal of this chapter is to summarize some of the findings that have been published since Bullock and Ha (2011) and to consider their implications for experimental mediation analysis. One theme of this chapter is that these implications depend on the motivation for doing the mediation analysis. In some cases, the motivation will be sufficient to justify mediation analysis, even when sensitivity bounds are quite wide. In other cases, it may make sense to scale back one’s goals and focus on alternatives to mediation analysis. The following two illustrative stories demonstrate these points. As an illustrative story demonstrating the scaling back of goals, consider the Supreme Court example discussed above. Acharya et al. (2018) find that, while Democratic respondents prefer Black nominees to white nominees, when they are told that these nominees are Democrats, this preference decreases dramatically. This decrease is known as an eliminated effect, and while it is tempting to interpret this effect as evidence of an indirect effect (i.e., Black nominees are assumed to be Democrats, while white nominees are assumed to be Republicans), the authors note that mediated interaction is another possible interpretation. In this example, a meditated interaction would work in the following manner. Suppose that when not told about party membership, respondents assume that both Black and white nominees are Republicans (perhaps due to other information provided about the nominees). Further, suppose that Black Republicans are preferred to white Republicans. Finally, if respondents are informed that both nominees are Democrats, they have no preference for Black Democrats over white Democrats. Note that this mediated interaction interpretation is consistent with an eliminated effect (the effect of racial perception goes away once party membership is provided) but is not consistent with an indirect effect. In the absence of party information, the

259

respondent assumes that both the white and the Black nominees are Republicans, and therefore perceptions of race do not affect perceptions of party. However, even though this eliminated effect is not consistent with an indirect effect, it is still consistent with party membership being part of the explanation for the effect. As we will discuss later in detail, given a parallel experimental design, it is distinguishing between these two interpretations – indirect effect and mediated interaction – that makes mediation analysis “hard” relative to other kinds of analyses. In contrast to the Supreme Court story, consider a stylized example (adapted from Pearl 2001) that shows when mediation analysis may be justified, despite being hard. Suppose a new drug is shown to reduce blood pressure in a randomized controlled trial. However, it is noticed that a side effect of the drug (headache) may cause increased aspirin intake. It may be that aspirin is causing the reduction in blood pressure, not the proposed mechanisms of the drug. Suppose further that mediation analysis using either the Baron and Kenny method or the estimation methods described in Imai et al. (2011) indicates that the effect of the drug is entirely mediated by aspirin intake, although sensitivity analysis methods show that this might be due to unmeasured confounders of the aspirin–blood pressure relationship. Despite the uncertainty represented in the sensitivity analysis, the drug agency decides not to approve the drug because of the prima facie evidence from the estimates. The pharmaceutical company considers redesigning the drug to remove the side effects causing the aspirin intake, but such a redesign will be very costly, and the company is unsure of the risk. In order to be more sure, the company conducts another randomized trial where the drug and aspirin are jointly randomized. In this trial, the drug is shown to reduce blood pressure even after controlling for aspirin intake. Based on the results in VanderWeele (2011), this trial implies that some effect of the drug must go through a mediator other than aspirin. This information is enough for the company to take the risk on the drug redesign. The redesigned drug goes through randomized

260

Adam N. Glynn

controlled trials, is shown to reduce blood pressure, and is approved by the drug agency. There are three things to note about this drug story. First, all of the decisions – the decision not to approve, the decision to conduct a new experiment, the decision to redesign the drug, and the final decision to approve – are justifiable based on the costs and the fact that the indirect effect (and not a close alternative) represents the effect of the redesigned drug. Second, even an analysis with wide sensitivity bounds can inform a decision. The drug agency would potentially have approved the drug if the initial mediation estimates indicated that the drug had a large direct effect, not through aspirin. Third, despite the fact that redesigning the drug was the only way to get a precise estimate of the indirect effect, there was value to the company in first conducting an exploratory parallel experiment where the original drug and aspirin were jointly randomized before redesigning the drug. Exploratory mediation analysis may be worthwhile prior to making costly decisions.2

14.2 Definitions 14.2.1 Potential Outcomes In order to formalize the discussion, we define a number of potential outcomes that we will subsequently use to define causal effects. With A binary (e.g., drug or placebo), we can define two potential outcomes: Y (A = 0) the outcome we would observe for the individual if A = 0 (e.g., the blood pressure we would observe for the individual if placebo is received) and Y (A = 1) the outcome we would observe for the individual if A = 1 (e.g., the blood pressure we would observe for the individual if the drug is received).3,4 In order to simplify 2 Often these costs will not be monetary (e.g., the costs of approving a drug with side effects). 3 It is straightforward to make A nonbinary, but the notation is complicated. 4 Note also that this definition assumes no interference between units (i.e., that one individual taking the drug does not affect the blood pressure of another individual). Chapter 16 in this volume discusses the relaxation of this assumption.

notation moving forward, we will write Y (A = 0) = Y (0) and Y (A = 1) = Y (1), and we will assume that the observed outcome is equal to the corresponding observed potential outcome such that Y = A · Y (1) + (1 − A) · Y (0). If we further define the mediator as M (e.g., aspirin intake), we can define a number of other potential variables and effects. First, we can analogously define the treatment effect for the mediator in terms of two potential mediators: M(A = 0) the mediator we would observe for the individual if A = 0 (e.g., the aspirin intake we would observe for the individual if placebo is received) and M(A = 1) the mediator we would observe for the individual if A = 1 (e.g., the aspirin intake we would observe for the individual if the drug is received). Again, in order to simplify notation moving forward, we will write M(A = 0) = M(0) and M(A = 1) = M(1), and assume M = A · M(1)+ (1 − A) · M(0). Second, with a binary mediator, we can define potential outcomes with respect to the mediator: Y (M = 0) the outcome we would observe for the individual if M = 0 (e.g., the blood pressure we would observe for the individual if aspirin is not received) and Y (M = 1) the outcome we would observe for the individual if M = 1 (e.g., the blood pressure we would observe for the individual if aspirin is received). Although we cannot simplify this notation without creating confusion with Y (A = a) potential outcomes, we do assume that the observed outcome can be connected to these potential outcomes: Y = M · Y (M = 1) + (1 − M) · Y (M = 0). Third, we can define the joint potential outcomes of the form Y (am) : the outcome we would observe for the individual if A = a and M = m. With a binary mediator (e.g., aspirin or not), the notation simplifies to the following: Y (00) (e.g., the blood pressure we would observe for the individual if no drug and no aspirin are received), Y (10) (e.g., the blood pressure we would observe for the individual if the drug and no aspirin are received), Y (01) (e.g., the blood pressure we would observe for the individual if no drug and aspirin

Advances in Experimental Mediation Analysis

are received), and Y (11) (e.g., the blood pressure we would observe for the individual if the drug and aspirin are received). With joint potential outcomes, the observed outcome is related in the following manner: Y = A · M · Y (11) + (1 − A) · M · Y (01) + A · (1 − M) · Y (10) + (1 − A) · (1 − M) · Y (00). Critically for mediation, we can also define cross-world potential outcomes of the form Y (aM(a )): the outcome we would observe for the individual if A = a and M = M(a ). With a binary mediator, the notation simplifies to the following options: Y (0M(1)) (e.g., the blood pressure we would observe for the individual with no drug and aspirin intake as it would have been if the drug had been received) and Y (1M(0)) (e.g., the blood pressure we would observe for the individual if the drug had been taken and aspirin intake had been as it would have been if the drug had not been taken). Unlike the standard potential outcomes described above, these can never be observed without additional assumptions. This lack of observability is a fundamental problem of mediation analysis. Note that we can use the same notation to rewrite standard potential outcomes, such as Y (1) = Y (1M(1)) (e.g., the blood pressure we observe under treatment is the blood pressure we observe under treatment with aspirin intake as it would be under treatment) and Y (0) = Y (0M(0)) (e.g., the blood pressure we observe under control is the blood pressure we observe under control with aspirin intake as it would be under control). Finally, note that all of the above implies that the causal order of these variables is understood. That is, we assume that A can affect M, but that M does not affect A. We also assume that A and M can affect Y , but that Y does not affect A or M. In other terms, we assume that there is no reverse causality. 14.2.2 Total Effects, Joint Effects, and Effect Modification We can define a number of causal effects as contrasts between the potential outcomes defined above. The most common of these

261

is the total effect of A on Y , defined as Y (1) − Y (0).5 In our running example, this effect represents the total effect of the drug on blood pressure. We can also define the total effects of A on M, M(A = 1)−M(A = 0), and the total effects of M on Y , Y (M = 1) − Y (M = 0). Similarly, we can define joint effects as contrasts between the four joint potential outcomes. For example, we might examine the joint effect of the drug and aspirin together Y (11) − Y (00) or the comparative effect of the drug versus aspirin Y (10)−Y (01). The Y (10) − Y (00) and Y (11) − Y (01) are known as controlled direct effects and are particularly important for our discussion on mediation. The first refers to the effect of the drug when the individual is forced to refrain from aspirin. The second refers to the effect of the drug when the individual is forced to take aspirin. There are two other joint effects that merit special mention: Y (01)−Y (00) and Y (11) − Y (10). The first of these is the effect of aspirin if the individual is forced to refrain from the drug. The second refers to the effect of aspirin when the individual is forced to take drug. What makes these effects different from the other joint contrasts is that, because the causal direction runs from A to M (i.e., the aspirin variable M does not have a causal effect on the drug variable A), we can equate these joint effects to the single-treatment effects of M on Y for certain subsets of the population. For example, (Y (01) − Y (00)) · (1 − A) = (Y (M = 1) − Y (M = 0)) · (1 − A) and (Y (11) − Y (10)) · A = (Y (M = 1) − Y (M = 0))· A. Chapter 15 in this volume addresses issues of such subpopulation or subgroup analysis. 14.2.3 Natural Direct and Indirect Effects In order to define a potential outcomes-based mediation analysis, we decompose the total effects of A on Y in two ways by adding and subtracting the cross-world potential outcomes: 5 It is also possible to define effects in terms of other contrasts, but differences represent the standard approach in political science.

262

Adam N. Glynn

Y (1) − Y (0) = Y (1M(1)) − Y (0M(0)) = Y (1M(1)) − Y (1M(0)) + Y (1M(0)) − Y (0M(0))



indirect

(14.1)

direct

= Y (1M(1)) − Y (0M(1)) + Y (0M(1)) − Y (0M(0))



direct

(14.2)

indirect

In both of these decompositions, we often refer to the direct and indirect effects as the natural direct and indirect effects (Pearl 2001), and which decomposition is of interest will often depend on context. For the blood pressure example, the indirect effect in Eq. (14.1) corresponds to the side effect of the drug on blood pressure if the drug has been received, and the direct effect corresponds to the effect of a redesigned drug that does not

cause headaches. Hence, it is this direct effect that is of primary policy interest. 14.2.4 Controlled versus Natural Direct Effects It is easy to conflate controlled direct effects and natural direct effects. To understand the difference, it is helpful to rewrite one of the natural direct effects in terms of the controlled direct effects:

Y (1M(0)) − Y (0M(0)) = (Y (10) − Y (00)) · (1 − M(0)) + (Y (11) − Y (01)) · M(0)





natural

controlled

(14.3)

controlled

= (Y (10) − Y (00)) +[(Y (11) − Y (01)) − (Y (10) − Y (00))] · M(0)



interaction

controlled

(14.4) Equation (14.3) shows that the natural direct effect of (14.1) is either equal to the Y (10) − Y (00) controlled direct effect or Y (11) − Y (01), depending on the value of M(0). For the blood pressure example, the natural direct effect is the effect of the redesigned drug, the first controlled direct effect is the effect of

the drug if the individual is forced to refrain from aspirin, and the second controlled direct effect is the effect of the drug if the individual is forced to take aspirin. We can also use this decomposition to define an eliminated effect (the difference between the total and controlled direct effect):

(Y (1) − Y (0)) − (Y (10) − Y (00))



total

controlled

= Y (1M(1)) − Y (1M(0)) + [(Y (11) − Y (01)) − (Y (10) − Y (00))] · M(0)





indirect



mediated

interaction

(14.5)



eliminated

= Y (0M(1)) − Y (0M(0)) + [(Y (11) − Y (01)) − (Y (10) − Y (00))] · M(1)





indirect



mediated

interaction

(14.6)



eliminated

In Eqs. (14.5) and (14.6), we see that the difference between the total effect and the controlled direct effect produces an indirect effect plus a mediated interaction. The

mediated interaction in Eq. (14.6) is salient for the stylized Supreme Court story told above. In that story, the Black nominee is preferred Y (1) = 1 to the white nominee

Advances in Experimental Mediation Analysis

Y (0) = 0; however, if the respondent is informed that both are Democrats (M = 0), then both are supported and the respondent is indifferent Y (10) = Y (00) = 1. Hence, the difference between the total effect and the controlled direct effect, sometimes known as the eliminated effect, is positive (Y (1) − Y (0)) − (Y (10) − Y (00)) = 1. One possible interpretation of this difference is an indirect effect (see 14.6). For the indirect effect in (14.6), if M(1) = 0 and M(0) = 1, then the Black nominee is assumed to be a Democrat while the white nominee is assumed to be a Republican, such that Y (0M(1)) = Y (00) = 1 and Y (0M(0)) = Y (01) = 0. However, another possible interpretation is a mediated interaction. Suppose that both the Black and white nominees are assumed to be Republicans M(1) = M(0) = 1, which means that Y (0M(1)) − Y (0M(0)) = Y (01) − Y (01) = 0 and there is no indirect effect. In this case, the difference between the total and controlled direct effects exists because the respondent supports the Black Republican Y (11) = 1 but not the white Republican Y (01) = 0.

14.3 Average Effects, Experimental Identification, and Sensitivity Analysis Instead of the individual-level effects defined above, we often focus on averages of these effects that are potentially identifiable by randomized experiment. However, due to

263

the complications of mediated interactions, average indirect effects are not identifiable by randomizing A and M. To see why, consider the role that linearity of expectations plays in the identification of average effects. Average total effects, such as E[Y (1) − Y (0)] can be identified by randomizing A, because the treated arm identifies E[Y (1)] and the control arm identifies E[Y (0)], while the linearity of expectations allows that E[Y (1)] − E[Y (0)] = E[Y (1)−Y (0)]. Additionally, jointly randomizing both A and M identifies the average joint potential outcomes E[Y (00)], E[Y (10)], E[Y (01)], and E[Y (11)], and the linearity of expectation again allows the combination of these into average controlled direct effects (e.g., E[Y (10)−Y (00)]) and even average interactions (E[(Y (11) − Y (01)) − (Y (10) − Y (00))]). Finally, randomizing A for a random subset of units and jointly randomizing A and M for a random subset of units allows the simultaneous identification of the average potential outcomes and average joint potential outcomes. This combined with linearity of expectations allows the identification of the eliminated effects (e.g., E[(Y (1) − Y (0)) − (Y (10) − Y (00))]). However, the average indirect effects (E[Y (1M(1)) − Y (1M(0))] and E[Y (0M(1)) − Y (0M(0))]) cannot be identified in this manner. One might hope to identify the average mediated interaction and subtract this out of the eliminated effect, but the average mediated interaction involves a product and hence cannot be identified by combination of average potential outcomes:

E[((Y (11) − Y (01)) − (Y (10) − Y (00))) · M(1)]

= ((E[Y (11)] − E[Y (01)]) − (E[Y (10)] − E[Y (00)])) · E[M(1)]. Yet, despite this lack of identification for the average indirect effects and the average mediated interaction, it is clear that randomized experiments provide some information about these effects. Much of the recent literature has provided insight into the quantification of this information for various experimental designs and the assumptions that can be used to supplement these designs.

14.3.1 Single-Experiment Design The single-experiment design involves randomization of A, but not M. This means that the average effects of A on Y (E[Y (1) − Y (0)]) and A on M (E[M(1) − M(0)]) are identified, but the average effects based on average joint potential outcomes are not identified without additional assumptions. A number of assumptions have been proposed for the identification of direct and indirect effects

264

Adam N. Glynn

C

A

M

Y

Figure 14.2 Directed acyclic graph depicting the key identification criteria for the single experiment: (1) can the relevant C variables be measured to block the backdoor paths from M to Y ; and (2) are those C variables unaffected by A? Unless the answer is yes to both of these questions, the indirect effect of A through M on Y cannot be identified “from a single experimental design without additional assumptions.”

under this design. Reviews of many of these can be found in TenHave and Joffe (2012), and VanderWeele (2015). The graphical criterion (Avin et al. 2005; Shpitser and VanderWeele 2011) is useful for intuition, and Pearl (2018) presents that criterion as the following: The variables A, M, and Y and potential control variables X can be modeled with a causal directed acyclic graph, and there exists a subset C of X , possibly an empty set. If C blocks all backdoor paths from M to Y , then the average joint effects, including controlled direct effects, of A and M on Y can be identified (e.g., E[Y (10) − Y (00)]). However, in order to additionally identify the natural direct and indirect effects, it must also be true that no member of C is a descendent of A. An informal and incomplete version of this criterion is that C variables that affect both the M and Y variables are measured, and that none of these variables is affected by A. This situation is depicted in Figure 14.2, where if the C variables are measured and adjusted for, then controlled direct effects can be identified, but if A affects C (the dashed arrow in Figure 14.2), then the natural direct and indirect effects of A through M on Y cannot be identified. Fortunately, when it is not known whether all C variables have been measured and it is not known whether A affects these variables, Imai et al. (2013) provide sharp bounds on these natural effects for binary outcomes, and Tchetgen and Shpitser (2012)

and VanderWeele and Chiba (2014) provide sensitivity analyses for nonbinary outcomes. When all C variables that have been affected by A have been measured, VanderWeele (2012) and Imai and Yamamoto (2013) provide methods for sensitivity analysis. Unfortunately, bounds and sensitivity analysis may not be too informative for this design. For bounding with a binary outcome, Imai et al. (2013) show that this design will always produce bounds containing zero (and hence fail to rule out the absence of an average indirect effect). 14.3.2 Parallel Experiment Design The parallel experiment design described in Imai et al. (2013) involves (1) randomly splitting the sample into two groups, (2) randomizing A for the first group, and (3) jointly randomizing A and M for the second group. As with the single-experiment design, the randomization of A in the first group identifies the average effects of A on Y (E[Y (1) − Y (0)]) and A on M (E[M(1) − M(0)]). The randomization of A and M for the second group identifies the average joint effects of A and M on Y . For example, the average controlled direct effects (E[Y (10) − Y (00)] and E[Y (11) − Y (01)]) are identified by this group. As discussed above, the identification of these effects allows the identification of the average eliminated effect but not the average indirect effect. However, while the parallel design does not

Advances in Experimental Mediation Analysis

point identify the indirect effect, it does provide a fair amount of information. This has been formalized in Imai et al. (2013), who demonstrate that, with binary outcomes, the bounds on the average indirect effect produced by this experimental design will sometimes not include zero. 14.3.3 Parallel Encouragement Design Sometimes we may only be able to indirectly manipulate the mediator M through the use of an instrument. For example, we may not be able to force individuals to take aspirin or to refrain from taking aspirin, but we may be able to encourage them to take or not take aspirin. In this case, we would not be able to use the parallel experiment design, but we would be able to use a parallel encouragement design where 1) the sample is randomly split into two groups, 2) A is randomized for the first group, 3) A and the instrument are jointly randomized for the second group. Imai et al. (2013) shows that even with this weaker design, when an exclusion restriction holds it is possible to get bounds on the average indirect effect that do not include zero. Furthermore, it also shows that bounds will be tighter on the average complier indirect effects (where compliers are those that would always do what they are encouraged to do) than for the average indirect effects. 14.3.4 Path-Severing Experiment Design The final experiment to consider is the socalled path-severing experiment (Pearl 2001). Instead of manipulating a variable, this type of experiment attempts to manipulate the causal path from A to M. In the blood pressure example, this would be a singleexperiment design on the redesigned drug (because the redesigned drug eliminates the causal path from the drug to aspirin intake). This type of analysis may also be referred to as implicit mediation analysis (Gerber and Green 2012). If we are willing to assume that the redesigned treatment has eliminated the effect of A on M, then this experiment will point identify the natural direct effects and, in combination with the single experiment or

265

parallel experiments, will identify the natural indirect effect. An alternative approach to a “pathsevering” design has been proposed in Yamamoto and Yeager (2019). In this approach, the path from A to M may not exist for certain individuals (which may be described by certain values of an effect modifier). For example, in our blood pressure example, suppose there are individuals that are known to be allergic to aspirin, and hence would only take painkillers that would not reduce blood pressure. In this situation, aspirin allergy functions as a “switch” in the terminology of the authors that turns off the path from A to M and that might allow us to identify the natural direct effect of A on Y (at least for a subpopulation).

14.4 Advanced Topics and Further Reading While the previous sections have focused on the motivations for mediation and the nuances of identification, large literatures have developed on a number of specialized topics. While we do not have the space to discuss these topics in detail, we present a short description as well as citations for further reading. That said, in all cases, these topics are covered by VanderWeele (2015), so interested researchers should start with that book. There are also some topics covered by VanderWeele (2015) – mediation for survival analysis (ch. 4) and time-varying exposures and mediators (ch. 6) – that may be of interest to some readers but have, to our knowledge, not been the subject of experimental analysis within political science to date. We do not discuss these topics here. 14.4.1 Estimation In all of the above discussion, we assumed that estimation was carried out using the simulation methods of the mediation R package. However, there are now a wide variety of alternative estimation techniques: the Baron and Kenny methods discussed

266

Adam N. Glynn

above, more flexible techniques based on regression (VanderWeele 2015, chs. 1–6), inverse probability weighting (IPW) (Huber 2014), semi-parametric estimators based on the influence function (Tchetgen and Shpitser 2012), and targeted maximum likelihood estimators (Zheng and van der Laan 2012). The more recent estimators have some advantages. For example, the estimators that model treatment but not mediators (e.g., the IPW estimator) may be easier to use with large numbers of mediators, and also limit the uncertainty associated with modeling all of the mediators. Also, the influence function and targeted maximum likelihood estimators have nice large-sample properties such as multiple robustness (the models for treatment, mediator, and outcome need not all be correct for consistent estimation) and semi-parametric efficiency. However, in sample sizes between 1000 and 4000 with a single mediator, the Huber et al. (2016) simulation analysis shows that the methods of the mediation R package perform quite well. Of course, future simulation studies may produce different results, but for the typical political science experiment with a small number of mediators, simulation methods may be appropriate. 14.4.2 Multiple Mediators and Multiple Measures of the Mediator In the discussion above, we have treated the possibility of additional mediators as a nuisance. For example, a post-treatment confounder of the mediator outcome relationship is itself a mediator. In some studies, additional mediators may be a subject of interest, not just a source of confounding. In our opening Supreme Court example we focused on the party mechanism, but there are a number of other potential mediators of interest. For example, the preference for Democrat respondents for Black nominees may be due to ideological differences beyond party. Imai and Yamamoto (2013) and VanderWeele (2015) provide methods for this type of analysis. and Avin et al. (2005) and Shpitser (2013) provide graphical

criteria for the identification of path-specific effects. It is also important to note that by changing the question of interest, one can sometimes avoid the problems associated with a post-treatment measured confounder of the mediator outcome relationship (as in Figure 14.2 if the arrow from A to C is present) (Avin et al. 2005; VanderWeele et al. 2014). This is discussed in more detail below. In addition to multiple meditators, we may at times have multiple measures of mediators. For example, there are a number of potential measures of ideology. This situation dramatically increases the complications of the analysis, as potential confounders and sensitivity analyses must be related to all possible measures (VanderWeele 2015, ch. 7), and even the effect of interest may be uncertain. One could also use the multiple measures to estimate a latent variable (if that fits the theory), but then the measurement error in the latent variable will need to be accommodated (VanderWeele 2015, ch. 7). Note that within the experimental context, a focus on mediators that can be manipulated will avoid some of these issues. 14.4.3 Alternative “Indirect” Effects We have already discussed the possibility of targeting the eliminated effect (i.e., a combination of the indirect effect and a mediated interaction) as an easier alternative to targeting the indirect effect. Much recent literature has been devoted to the proposal of other close alternatives to the natural indirect effects in the presence of a post-treatment measured confounder of the mediator– outcome relationship (as in Figure 14.2 if the arrow from A to C is present). VanderWeele et al. (2014) summarize three quantities that are closely related to natural indirect effects but that can be estimated with post-treatment confounders. The first quantity is simply to redefine the indirect effect to include all potential posttreatment mediators. In the Supreme Court example, we might redefine the mediator to be vector valued and include both party and ideology. If there are no other left out C

Advances in Experimental Mediation Analysis

variables, then this new quantity would not suffer from a post-treatment confounder. The second, related quantity is a pathspecific effect. In Figure 14.2, if A affects C and C affects M, it will not be possible to identify all of the indirect effects through M. However, Avin et al. (2005) show that if all such C variables can be measured, it is possible to identify certain effect such as the effect specific to the A → M → Y path (without including the A → C → M → Y portion of the indirect effect). In the Supreme Court example, if C is partisanship and M is ideology and we assume the absence of other missing variables, we might identify the effect of nominee race through ideology, but not partisanship. Finally, a number of authors have proposed indirect effects based on assigning a random value to the mediator instead of a fixed value based on an individual potential outcome (Didelez et al. 2012; Geneletti 2007; VanderWeele et al. 2014; Vansteelandt and Daniel 2017).

14.5 Why Do a Mediation Analysis? As discussed above, the success or failure of a mediation analysis can depend on the reasons for doing one. Put simply, mediation analysis is hard, so the researcher may want to consider whether the natural indirect and direct effects are really the only quantities of interest or whether an alternative quantity might suffice (or supplement). VanderWeele (2015) summarizes seven motivations that have been given for conducting mediation analysis. 14.5.1 Motivations (1) (2) (3) (4) (5)

Scientific understanding Confirm or refute a theory Refine an intervention Discarding components Determine reason for no apparent total effect (6) Not being able to intervene directly (7) Bolster claim of effect Knowledge of direct and indirect effects is certainly important for scientific understanding (the first motivation), but

267

supplementing a mediation analysis can also help with understanding. In the Supreme Court example, it was supposed that the effect of race on support for nominees was due to respondent assumptions about the partisanship of the nominees. This is true even if we cannot distinguish between the indirect effect (respondents assumed that Black nominees were Democrats and white nominees were Republicans) and the mediated interaction (respondents assumed that both nominees were Republicans but were willing to overlook this among Black Republicans). One can make a claim that scientific understanding has advanced without overclaiming that an indirect effect has been identified. Additionally, the presentation of an eliminated effect analysis prior to the presentation of a mediation analysis (preferably along with a sensitivity analysis) can aid in understanding by clarifying the source of the uncertainty for the indirect effect. The second motivation – confirm or refute a theory – may or may not require mediation analysis. If the single-experiment design for the blood pressure example had produced an estimated average indirect effect of zero, then the drug agency may have decided to approve the drug on the basis of the mediation analysis. In fact, if the treatment does not appear to have an effect on the mediator and negative effects can be ruled out by theory, then a regression of the mediator on the treatment may be sufficient. Also, if with the joint randomization of the drug and aspirin the drug is shown to reduce blood pressure even after controlling for aspirin intake, then this implies that some effect of the drug must go through a mediator other than aspirin (and we can rule out the theory that the entirety of the effect goes through aspirin). Hence, a full-fledged mediation analysis may not be necessary to refute the theory that the effect goes entirely through aspirin. The third and fourth motivations – refinement of a treatment and discarding components – do seem to often necessitate mediation. The third motivation corresponds to the scurvy example from Gerber and Green (2012). In that case, learning that limes

268

Adam N. Glynn

prevent scurvy by providing vitamin C allows the refinement of treatment (bring vitamin C on the ship instead of lots of limes). The fourth motivation matches the blood pressure example in that we would want to discard the headache-causing component of the drug. The fifth motivation – determining the reason for no apparent total effect – would sometimes require mediation (e.g., if we suspect that positive indirect effects are canceling with negative direct effects). However, if the goal is just to determine why an experiment did not work, a simple manipulation check may be sufficient (Gerber and Green 2012). The sixth motivation – that we are not able to intervene directly – may be a sufficient reason. However, this motivation will typically correspond to nonexperimental mediation analysis because, in experimental analysis, the treatment can usually be manipulated. Finally, VanderWeele (2015) notes that the seventh motivation – to bolster the claim of an effect – does not seem well founded outside of the basic sciences. This seems especially true when the treatment can be experimentally randomized. 14.5.2 Decisions In addition to considering the above motivations, it can be helpful to consider whether the motivation for a mediation analysis is connected to any specific decisions that need to be made. Some motivations may not be tied to immediate decisions (e.g., scientific understanding), but others will. The blood pressure example highlights three such decisions: (1) the decision to approve a treatment or recommend a policy; (2) the decision to conduct an additional experiment; and (3) the decision to design a new treatment or redesign an old treatment. The motivation to refine an intervention or discard a component may be tied to the third type of decision (to design a new treatment or redesign an old treatment). For these decisions, costs play an important role, and unless there is provisional evidence that the new/redesigned treatment might be effective, it may not be worth the cost.

A number of the aforementioned motivations may be tied to the second type of decision (to conduct an additional experiment). This situation can occur when a mediation analysis was not originally planned, so a single-experimental design was used, but subsequent analysis of secondary outcomes (i.e., potential M variables) implies that mediation may be an important consideration. In this case again, it may not be worth conducting an additional parallel experiment unless the provisional evidence from the single experiment seems suggestive. Finally, the first type of decision (to recommend a treatment or policy) seems somewhat distinct from the motivations listed above. If mediation analysis based on the singleexperimental design provides provisional evidence that a treatment appears to be working through an unanticipated mechanism, then this can affect the decision to approve or not to approve the treatment/policy. Such decisions may be quite complicated, as they will depend on costs (e.g., headaches) and the availability of alternative treatments (e.g., aspirin).

14.6 Conclusion This chapter has summarized some recent findings on mediation analysis and discussed the implications of some of these findings for practice. One important finding has been the limited information inherent in the single-experiment design. Bounds from this design always contain an average indirect effect of zero, and the most robust version of sensitivity analysis using this design seems to typically be unable to rule out an average effect of zero. Therefore, the single-experiment design should be used only when limited information is sufficient, mediators cannot be manipulated (either directly or indirectly), or when we can measure the confounders of the mediator– outcome relationship. As discussed, limited information may be sufficient for certain types of decisions. In contrast, parallel designs have been shown to provide more substantial information, with even the

Advances in Experimental Mediation Analysis

parallel encouragement design providing bounds that rule out zero in some contexts. When possible, parallel designs should be used. Finally, the reported motivations for mediation analysis have been discussed, and it was noted that some motivations would imply that alternative approaches, such as the eliminated effect, should be used to supplement (or maybe substitute for) a mediation analysis.

References Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2016. “Explaining causal findings without bias: Detecting and assessing direct effects.” American Political Science Review 110(3): 512–529. Acharya, Avidit, Matthew Blackwell, and Maya Sen. 2018. “Analyzing causal mechanisms in survey experiments.” Political Analysis 26: 357–378. Avin, Chen, Ilya Shpitser, and Judea Pearl. 2005. Identifiability of path-specific effects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence. Burlington, MA: Morgan Kaufmann Publishers, Inc., pp. 357–363. Baron, Reuben M. and David A. Kenny. 1986. “The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations.” Journal of Personality and Social Psychology 51(6): 1173. Bullock, John G., Donald P. Green, and Shang E. Ha. 2010. “Yes, but what’s the mechanism? (Don’t expect an easy answer).” Journal of Personality and Social Psychology 98(4): 550. Bullock, John G. and Shang E. Ha. 2011. “Mediation Analysis Is Harder than It Looks.” In Cambridge Handbook of Experimental Political Science. Cambridge, UK: Cambridge University Press, p. 959. Didelez, Vanessa, Philip Dawid, and Sara Geneletti. 2012. “Direct and indirect effects of sequential treatments.” arXiv preprint arXiv:1206.6840. Duncan, Otis Dudley. 1966. “Path analysis: Sociological examples.” American Journal of Sociology 72(1): 1–16. Geneletti, Sara. 2007. “Identifying direct and indirect effects in a non-counterfactual framework.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2): 199–215.

269

Gerber, Alan S. and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W. W. Norton. Huber, Martin. 2014. “Identifying causal mechanisms (primarily) based on inverse probability weighting.” Journal of Applied Econometrics 29(6): 920–943. Huber, Martin, Michael Lechner, and Giovanni Mellace. 2016. “The finite sample performance of estimators for mediation analysis under sequential conditional independence.” Journal of Business & Economic Statistics 34(1): 139–160. Imai, Kosuke, Dustin Tingley, and Teppei Yamamoto. 2013. “Experimental designs for identifying causal mechanisms.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 176(1): 5–51. Imai, Kosuke, Luke Keele, Dustin Tingley, and Teppei Yamamoto. 2011. “Unpacking the black box of causality: Learning about causal mechanisms from experimental and observational studies.” American Political Science Review 105(4): 765–789. Imai, Kosuke, and Teppei Yamamoto. 2013. “Identification and sensitivity analysis for multiple causal mechanisms: Revisiting evidence from framing experiments.” Political Analysis 21(2): 141–171. MacKinnon, David. 2012. Introduction to Statistical Mediation Analysis. Abingdon: Routledge. Pearl, Judea. 2001. “Direct and indirect effects.” In Procedings of the 17th Conference on Uncertainty in Artificial Intelligence, eds. J. S. Breese and D. Koller. Burlington, MA: Morgan Kaufman Publishers, Inc., pp. 411–420. Pearl, Judea. 2018. “Causal and Counterfactual Inference.” In The Handbook of Rationality. Cambridge, MA: MIT Press, pp. 1–41. Robins, James M., and Sander Greenland. 1992. “Identifiability and exchangeability for direct and indirect effects.” Epidemiology 3: 143–145. Sen, Maya. 2017. “How political signals affect public support for judicial nominations: evidence from a conjoint experiment.” Political Research Quarterly 70(2):374–393. Shpitser, Ilya. 2013. “Counterfactual graphical models for longitudinal mediation analysis with unobserved confounding.” Cognitive Science 37(6): 1011–1035. Shpitser, Ilya, and Tyler J. VanderWeele. 2011. “A complete graphical criterion for the adjustment formula in mediation analysis.” The International Journal of Biostatistics 7(1): 1–24. Tchetgen Tchetgen, Eric J. and Ilya Shpitser. 2012. “Semiparametric theory for causal mediation

270

Adam N. Glynn

analysis: efficiency bounds, multiple robustness, and sensitivity analysis.” Annals of Statistics 40(3): 1816. TenHave, T. R. and M. M. Joffe. 2012. “A review of causal estimation of effects in mediation analysis.” Statistical Methods of Medical Research 21: 77–107. VanderWeele, Tyler. 2015. Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford: Oxford University Press. VanderWeele, Tyler J. 2011. “Controlled direct and mediated effects: definition, identification and bounds.” Scandinavian Journal of Statistics 38(3): 551–563. VanderWeele, Tyler J., and Yasutaka Chiba. 2014. “Sensitivity analysis for direct and indirect effects in the presence of exposureinduced mediator–outcome confounders.”

Epidemiology, Biostatistics, and Public Health 11: e9027. VanderWeele, Tyler J., Stijn Vansteelandt, and James M. Robins. 2014. “Effect decomposition in the presence of an exposure-induced mediator-outcome confounder.” Epidemiology (Cambridge, Mass.) 25(2): 300. Vansteelandt, Stijn, and Rhian M. Daniel. 2017. “Interventional effects for mediation analysis with multiple mediators.” Epidemiology (Cambridge, Mass.) 28(2): 258. Yamamoto, Teppei, and David Yeager. 2019. “Causal mediation and effect modification: a unified framework.” Working paper. Zheng, Wenjing, and Mark J. van der Laan. 2012. “Targeted maximum likelihood estimation of natural direct effects.” International Journal of Biostatistics 8(1): 1–40.

C H A P T E R 15

Subgroup Analysis Pitfalls, Promise, and Honesty∗

Marc Ratkovic

Abstract Experiments often focus on recovering an average effect of a treatment on an outcome. A subgroup analysis involves identifying subgroups of observations for which the treatment is particularly efficacious or deleterious. Since these subgroups are not preregistered but instead discovered from the data, significant inferential issues emerge. We discuss methods for conducting honest inference on subgroups, meaning generating valid p-values and confidence intervals that account for the fact that the subgroups were not specified a priori. Central to this approach is the splitsample strategy, where half of the data are used to identify effects and the other half are used to test them. After an intuitive and formal discussion of these issues, we provide simulation evidence and two examples illustrating these concepts in practice.

15.1 Introduction In an experimental analysis, randomized assignment of a treatment variable allows for unbiased estimation of an average causal effect of the treatment. The average effects of interest are specified in advance by the researcher, and standard inferential tools allow estimation and testing of these effects. * I thank James N. Druckman, Donald P. Green, Adam Glynn, Zenobia Chan, and Stephanie Zonszein for feedback and comments.

Every average effect, though, is itself a composite of lower-level subgroup effects. A subgrouping is a partitioning of the sample into mutually exclusive subsets, normally split on observed covariates (e.g., Berry 1990). A subgroup is one of these homogeneous subsets, such as females, residents of a particular town, or White voters under 30, and a subgroup effect is the average treatment effect for this subgroup. While randomization allows us to safely average over subgroups, a subgroup analysis turns this problem on its head: Given 271

272

Marc Ratkovic

data, how can we identify subgroups of the data where the treatment was most or least efficacious (e.g., Assmann et al. 2000; Lagakos 2006; Rothwell 2005)? Subgroup effect estimation is crucial in at least four different settings. First, it can help identify the most impactful treatment from a set of possible treatments. In the face of increasingly complex designs, such as the conjoint analysis (Hainmueller et al. 2014), the most effective treatment may involve a combination of two or three possible treatment conditions. Second, it can help characterize an ideal treatment for a given observation, which is of particular import when making policy prescriptions (Murphy 2003). Third, a subgroup analysis can help with recent concerns over replicability. It may be that average estimate effects may fluctuate from one sample to the next, but this fluctuation may be attributable to the fact that the samples have different distributions of underlying subgroups. Fourth, the analysis can help guide the researcher in designing the next experiment. As experimental analyses are part of a slow, careful accumulation of causal results (e.g., Samii 2016), a subgroup analysis on a given experiment can help illuminate a likely mechanism and encourage future studies to focus on where an effect is most likely to be realized. Despite this promise, the subgroup analysis raises two interrelated problems endemic to its design. A subgroup may be of theoretical interest and be included in a preregistration plan prior to the experiment being conducted and then tested as normal. When we say a subgroup analysis, though, we refer explicitly to an analysis where the goal is to identify effect heterogeneity among subgroups that were not specified prior to administering the experiment. The ex post nature of the analysis makes inference challenging. Simply reporting p-values from tests on subsets of the data that are not preregistered is among the worst forms of data-dredging (e.g., Simmons et al. 2011). Second, the number of possible subgroups grows exponentially with the number of covariates. By the time we include a modest number of covariates, the number of potential

subgroups can grow to the hundreds or thousands. A wide range of off-the-shelf machine learning methods can be used to relax model heterogeneities in data (e.g., Beck et al. 2000; Green and Kern 2012; Grimmer et al. 2017; Hill and Jones 2014; LeBlanc and Kooperberg 2010; Montgomery and Olivella 2018). As we discuss below, these methods suffer in two regards. First, these off-theshelf machine learning methods are tailored for optimal prediction rather than optimal subgroup estimation. The difference between the two is that the best predictive values are driven by confounders, like partisanship or previous behavior, and most of the information in the data goes toward learning these confounding effects rather than the treatment effects, which are often an order of magnitude smaller. Optimal subgroup estimation instead requires focusing on variables that drive the treatment effect, not those that drive the outcome. Second, these methods do not allow for valid inference on subgroup effects. Learning subgroup effects and conducting inference on them is too much to ask of any single data set, regardless of the statistical method being used. Without reliable inferential claims, subgroup analysis devolves into fishing. We introduce a set of practices designed to allow for optimal estimation of subgroup effects, as well as for valid inferential claims to be made about these inferential effects. In this work, we provide an overview of subgroup analysis aimed at the practical user. The discussion divides into three parts. The first details how a subgroup analysis fits into the design and preregistration of an experiment. We show how to conduct an honest subgroup analysis, where the word “honest” takes on both technical and colloquial meanings. The key, as we discuss more below, is to think of a subgroup analysis as a process of discovery rather than a test, so the process itself needs to be specified in advance even if the outcome is unknown. We focus on how to generate optimal point estimates and valid uncertainty estimates, where the latter point has remained underdeveloped in the literature. We then revisit these concepts in

Subgroup Analysis

a formal framework, highlighting how these ideas come into play when thinking about heterogeneity and inference. In this section, we discuss several methods that can be utilized for a subgroup analysis. The third section contains an illustrative simulation showing how the split-sample approach reduces bias in subgroup estimation. The section also includes two worked-through examples, the first with a single binary treatment and the second illustrating how to conduct a subgroup analysis with multiple treatments and levels from a conjoint experiment. The conclusion discusses avenues of future work.

15.2 Design, Preregistration, and Honesty The “replication crisis” that originated in several of our cognate fields has spurred a reconsideration of our experimental procedures (Gelman and Loken 2014 provide a nice overview). At root, this crisis stems from the divergence between the theoretical guarantees of our means of inference and actual practice. Adjusting how we run experiments, including now-standard practice such as preregistration and a preanalysis plan, guards against the worst threats to the validity of our hypothesis testing. The focus of this chapter is on extending these same ideas to a subgroup analysis. We will describe a way of conducting honest subgroup analysis. We mean the word “honest” in two senses: first, formally, in that the procedures we discuss achieve a theoretically guaranteed error rate; but also second meaning, in that an honest procedure creates statistical guardrails against misleading or deceptive inferences. The focus of this chapter, then, is not just on estimating subgroup effects, but also inference on estimated or recovered effects. We first describe honesty in the context of testing an ex ante specified single null hypothesis, illustrating notions of validity, controlling an error rate, and honesty. In this section, we abstract what makes a testing procedure valid.

273

We then move on to generalize these concepts to a subgroup analysis. 15.2.1 Honest and Efficient Inference on a Single Hypothesis We focus on three separate concepts in thinking about inference on a single, prespecified hypothesis: validity, honesty, and power. A test statistic for the hypothesis is valid if the false-positive rate can be controlled by the researcher. A false positive occurs when a statistically significant result is observed even though the null hypothesis, normally of no causal effect, is true. We say that the false-positive rate on a test is controlled at rate α if the researcher can guarantee that the proportion of statistically significant results that would be observed under the null hypothesis is no more than α. Controlling the false-positive rate is a first-order concern in experimental studies, as statistical significance serves as a crucial and necessary step in establishing that an estimated effect reflects a systematic relationship in the data. A testing procedure is honest if it results in a valid test statistic. For example, preregistering a design and hypothesis and then testing the hypothesis using a difference in means, as described in a preanalysis plan, results in an honest test. An honest procedure has several components, and each can be violated if care is not exercised. That these violations lead to invalid p-values is well understood (e.g., Gelman and Loken 2014; Wasserstein and Lazar 2016). Preregistering handles these threats for the case of a single hypothesis or a small set of predeclared hypotheses. First, the data-generating process must be fixed, which, in practice, requires registering not just the design, but also how variables will be coded (Gelman and Loken 2014). Second, the hypothesis must be generated independent of the data used to test it. Prespecifying a hypothesis prior to running the experiment satisfies this requirement, though we discuss additional methods for doing so below in the context of a subgroup analysis. Third, the full set of hypotheses to be tested must be specified in advance. This guards against data

274

Marc Ratkovic

dredging, and again is satisfied by requiring the researcher to hypothesize effects prior to the experiment. Lastly, a valid test statistic must be used. If several exist, a more powerful method should be selected. In the case of a single hypothesis, using a t-test will achieve this goal, but the issue of power grows more important when learning hypotheses from the data. 15.2.2 Honest and Efficient Inference with a Subgroup Analysis An experiment can help uncover three sets of causal effects: hypotheses specified in advance; hypotheses learned from the data; and discovered effects. The three classes differ on the persuasiveness of their evidence, ranging from highest to lowest in the order presented. The first class is already addressed with current preregistration practices, so we move instead to the next two classes, which are the focus of subgroup analysis. The second class are hypotheses learned and tested from the data. In order to return honest p-values, the procedure must maintain two of the attributes given above. The procedure must not identify effects and test them on the same data. Recent work has advocated “sample splitting” as a central feature of maintaining honesty (Athey and Imbens 2016; Chernozhukov et al. 2018; Wager and Athey 2018). In this framework, half the data are used to identify promising subgroups and the other half of the data are used to test them. Were the same data to both generate hypotheses and test them, the p-values would not be valid; sample splitting serves a crucial role in maintaining honesty (see, e.g., van der Vaart 1998, ch. 25). Concerns over power emerge when trying to estimate subgroup effects. While an average effect may be estimated from all of the data, each subgroup is estimated over a smaller subset. Estimating relevant subgroups at this stage confronts an additional and subtler issue: the subgroup effects we are interested in have an impact that is an order of magnitude less than the effect of the confounders. This has substantial implications on the estimation

strategy. Most off-the-shelf machine learning methods try to predict the observed outcome as accurately as possible; this is a distinctly different concept from trying to estimate a causal effect. In many settings, the most important predictive variables are those that are the best known and least interesting. For intuition, consider the problem of predicting whether an individual exposed to a treatment condition votes. The strongest treatment effect may come from engaging in meaningful conversation with a canvasser, but this effect is an order of magnitude less than whether the respondent voted in the last election. A method tuned for prediction will spend quite a bit of power in the data learning the relationship between past and future voting, while a method sensitive to heterogeneity will ignore the past voting variable and focus primarily on variables that involve the treatment (Imai and Ratkovic 2013). The predictive models are distracted by these known but strong confounders. Instead, the estimation strategy needs to target causal heterogeneities and avoid these known effects. Doing so involves rethinking standard estimation strategies, and we return to how to accomplish this below. Inference on subgroup effects at this stage comes with two important caveats. First, since the data are split in half, so is the power of this method. This only seems fair, though, since we are asking two things of the data: identifying a subset of subgroups and then testing them. The second caveat is that both splits of the data come from the same single experiment. The subgroups identified, then, may be the result of a peculiarity of this particular experiment, which serves as the key distinction between the first two classes of hypotheses. A crucial question for the second class of hypotheses is how to identify them from the data. We discuss two different sets of methods below. The first point identify subgroups in one split and test them in a second. The confidence intervals and p-values can be read off the test split. The second set return a fitted model of treatment effects and must be explored by the researcher. Looking at

Subgroup Analysis

plots of the estimated effects variable by variable is itself a form of exploration that must be accounted for; we recommend preregistering the plots (say, all one-way or two-way effects across variables) and taking a Bonferroni-adjusted threshold for significance. For example, a researcher interested in learning subgroup effects across five variables, ignoring interactions among them, should use a split-sample machine learning method to estimate confidence intervals across each variable, but use a p-value threshold of 1/5 times their allowable false-positive rate (say 0.1 or 0.05) on the subgroup effects. The final set of subgroup effects are those for which we make inferential claims about the proportion of false positives in the entire set rather than the false-positive rate on any particular hypothesis. Rather than control the false-positive rate – the probability that a given statistically significant effect is in truth null – we control the false-discovery rate: the proportion of discovered effects that are false. We describe a means of doing so below, but for this third class of effect, our inferential goals change. In the previous set of hypotheses, we are trying to make claims about how a treatment effect varies along a particular covariate. In this set, we are looking through a potentially massive set of hypotheses and attempting to discover plausible candidates to be tested in the next experiment. This final set of effects should be considered when the number of potential hypotheses is massive. This may occur in two settings. First, if the researcher is considering all three- or four-way interactions, which is plausible in a conjoint setting, the number of possible subgroup effects may quickly grow to the thousands or above. Second, the researcher may have a large number of covariates. For example, if the pretreatment covariates include a textual component, the term-document matrix may have thousands of unigrams or bigrams. In this case, there is little hope of testing each hypothesis, but we may be able to identify subgroups that may be worth testing in the next experiment.

275

15.2.3 Concrete Recommendations for Subgroup Analysis in a Preanalysis Plan For clarity, we provide our concrete recommendations for generating an honest subgroup analysis below: (1) Preregister the covariates, along with the level of interaction, that are going to define the subgroups targeted for inference. (2) Preregister the method and, if applicable, the random seed with which it will be initialized. Utilize a method that estimates subgroup effects directly to maximize power, and implement a sample splitting in this step to maintain honesty of inference. (3) Preregister the marginal plots that will be used to look for heterogeneity. Every plot should be thought of as a degree of freedom, such that the p-value threshold should be Bonferroni adjusted. These uncovered effects can be discussed using the language of significance, but with the understanding that the process is conditional on the particular implementation of the experiment. (4) In order to define discovered effects, preregister an acceptable false-discovery rate and implement a method that achieves this rate. No inferential claims about statistical significance can be made about these, but interesting discoveries should be noted for future experimentation. With this high-level discussion out of the way, we turn to a formal discussion of the problem and then discuss several options for implementing a subgroup analysis.

15.3 The Formal Framework Treatment effects and subgroup effects are best characterized in the potential outcomes framework (Holland 1986; Imbens and Rubin 2015). We consider observations from a random sample i ∈ {1, 2, . . . , N }. We observe for each observation an outcome Yi , a vector,

276

Marc Ratkovic

and a vector of covariates Xi , with X an arbitrary covariate profile. We will denote as zi ∈ {0, 1, 2, 3, . . . , K } the random vector taking a value in one of K + 1 different treatment conditions, Zi as its observed value, and Z as an arbitrary value that can be taken by the treatment. In the case of a single binary treatment, zi ∈ {0, 1}. In a more complex setting, say with one treatment with three levels and another with four levels, K = 11 = 3 × 4 − 1, the total number of treatment conditions beyond the control condition. The level zi = 0 is reserved for the baseline level for each treatment. In this setting, each observation i has a potential outcome function yi (Zi ) = Yi that maps each treatment level to an observed outcome.1 The average treatment effect for treatment condition Z ∈ {0, 1, 2, . . . , K} is denoted as τi (Z) = yi (Z) − yi (0) and with average effect

 τ (Z) = E yi (Z) − yi (0) .

(15.1)

If the treatment is randomized, a differencein-means estimate is unbiased for τ (Z). We are also interested in the subgroup effect for observations with some covariate profile of interest X . We write this conditional average treatment effect (CATE) as 

τ (Z; X ) = E yi (Z) − yi (0)|Xi = X , which is the treatment effect of condition Z for observations with covariate profile X . We next turn to two central issues: how to estimate these heterogeneities effectively and how to conduct inference on them honestly. 15.3.1 Maximizing Power in Estimating Subgroup Effects Uniformly powerful tests are simple when estimating average treatment effects, as 1 We make the standard assumptions that each treatment has only one version, there is noninterference among units, and every treatment condition is realized with positive probability.

t-tests or least squares models can return unbiased and efficient estimates. Power becomes more important in estimating subgroup effects. First, of course, each subgroup is characterized by a fraction of the data. Second, we need to differentiate between the best predictive model and the model best suited to uncover subgroup effects. Off-the-shelf machine learning methods that find the best prediction focus on the most pronounced aspects of the data. This results in the method “learning” noncausal relationships in the data that are already known to, and often uninteresting to, the researcher. Efficient and powerful estimation of subgroup effects requires differentiating a predicted value from a subgroup effect and directly targeting the latter. To formalize, we follow Athey and Imbens (2016) and distinguish between two different types of estimation strategies. The first attempts to explain as much variance in the outcome as possible given the treatment and outcomes. Taking μ(Z, X ) = E(Yi |Zi = Z, Xi = X ), the first minimizes 

μ(Zi , Xi ))2  μ(Z, X ) = argmin E (Yi −   μ

which is the best predictive model. Offthe-shelf machine learning methods will attempt to minimize this predictive error. The subgroup analysis can at its most general can be characterized as finding the combinations of potential treatments and covariate profiles that give the largest value of treatment heterogeneity. The second set of methods attempt to explain as much treatment heterogeneity as possible:  τ (Z, X ) = argmin E  τ

 2  τ (Zi ; Xi ) −  τi (Zi , Xi )

We should favor methods that directly estimate  τ rather than those that estimate  μ and reconstruct  τ . The former group is efficient, while the latter is not. The difference between the two approaches is subtle, but important. The predictive approach will attempt to learn the best model

Subgroup Analysis

of the outcome, which includes more than the treatment effects of interest. Consider partitioning the predictive loss function into two components: a component that varies with the treatment and one that does not. The second element, which does not vary with the treatment, is of no interest to the researcher, and any information in the data spent on this component is wasted. The treatment heterogeneity approach only considers variance in the outcome that can be explained with the treatment variable. The predictors that are of no interest are differenced out, focusing estimation on covariates that drive the treatment effect. Characterized in this way, two questions present themselves: How can we discover these heterogeneities and how can we conduct inference on our findings? We turn to each question in turn. 15.3.2 Estimating Treatment Effect Heterogeneity We discuss two approaches that can be used to generate efficient estimates of treatment effects. The first approach was most recently advanced in a series of papers by Susan Athey and colleagues (Athey and Imbens 2016; Athey et al. forthcoming; Wager and Athey 2018). The loss function  τ (Z, X )



τ (Zi , Xi ))2 = argmin E (τ (Zi ; Xi ) −   τ

is not feasible, since we do not know the true function τ (Z; X ). The authors show that the estimate 

τ (Zi , Xi )2  τ (Z, X ) = argmin − E   τ

will optimally recover treatment effect heterogeneity.2 The loss function offers a nice, practical interpretation: a method that attempts to explain as much treatment 2 The derivation relies crucially on having an unbiased estimate of  τ (Zi , Xi ), which itself requires a splitsample approach to estimation. We return to this point below, but see Athey and Imbens (2016) for a derivation.

277

effect heterogeneity as possible is optimal for recovering subgroup effects. The authors have then developed a suite of tree- and forest-based methods that efficiently estimate subgroup effects. A second approach to avoiding confounders involves removing their effect prior to identifying the subgroups. Recently advocated by Chernozhukov et al. (2018), the approach requires a two-step procedure. In the first step, the effect of the confounding variables is taken out of the outcome and treatment variable, generating variables i = Yi − E(Yi |Xi );  Zi = Zi − E(Zi |Xi ). Y We refer to these as the “partialed-out” variables (Chernozhukov et al. 2018; Neyman 1979; Robinson 1988), since we have subtracted off the impact of the confounders. Importantly, any machine learning method can be used to conduct the partialing out; see Chernozhukov et al. (2018) for formal details. We then run a method for discovering subgroups on these partialed-out values. Importantly, if the treatment is properly randomized by the experimenter, such that the value of E(Zi |Xi ) is known, then this value need not be estimated. In this case, a predictive model and a model incorporating inverse probability of treatment weights are asymptotically indistinguishable. If, on the other hand, there is some fear that the experiment was not perfectly executed, then the researcher may prefer to utilize a method to adjust for any bias that can be accounted for by the covariates. Both methods were designed not just with an eye to estimation, but also inference. They are both implemented using a split-sample approach in order to ensure honest inference, a point to which we turn next. 15.3.3 Split-Sample Approaches for Honest Inference As discussed above, the goal with a subgroup analysis is not just to identify relevant subgroups, but to make inferential claims about those that are uncovered. The classes of methods described directly above were

278

Marc Ratkovic

designed to be implemented using a split sample. A split sample involves taking the data and simply splitting them into two equally sized subsets; the splits may be done completely at random or may be done with respect to any blocking in the experiment. For tree-based methods, one split of the data is used to learn the tree structure, and then the second split of the data is used to conduct inference at each terminal node; a set of these trees may be aggregated up to a forest. Similarly, using the partialing-out approach, one split of the data is used to remove the effect of the confounders and the second is used to learn subgroup effects. Previous work has used the split-sample approach to recover honest estimates of regression coefficients (Chernozhukov et al. 2018; Robinson 1988) or average treatment effects (Wager and Athey 2018). We extend this approach to show, in some generality, that the split-sample approach can be used to guarantee valid inference on a test statistic. Specifically, we describe an honest procedure where the split-sample approach can be used to conduct valid inference. The method involves a split-sample approach. The first split of the data is used to select a subgroup effect to be tested, and a null hypothesis of zero effect is assumed. The second split of the data is used to test this hypothesis, using any valid test. We show that this method is honest. To formalize, we assume a set of null hypotheses, each corresponding with a potential subgroup effect upon which we may wish to conduct inference. We will denote null hypothesis h as H0 with h ∈ {1, 2, . . . , H}. The observed sample is denoted S. We assume a test statistic  th for null H0h and a significance threshold as th∗ such that the researcher can control the false-positive rate on the test. For example,  th may be a z-statistic and th∗ the familiar threshold of 1.96. An effect is significant if the test statistic is larger than the threshold. We also assume that  th is estimated and th∗ selected to give the same false-positive rate across all hypotheses. We use the data to learn a promising subgroup, say h(S), and

h(S )

make a null hypothesis about it, H0 , to differentiate the null hypothesis from a hypothesis made independent of the data, H0h . Under null hypothesis H0h , a false positive occurs if 1(| th | > th∗ ), and the false-positive rate we wish to achieve is  

 E 1 | th | > th∗ |H0h 

 = E E 1(| th | > th∗ )|S, H0h , where the outer expectation is over repeated samples under the null hypothesis. The issue in a full-sample subgroup analysis is that we have consulted the data to learn the hypothh(S ) esis H0 , which renders our false-positive rate incorrect: 

 E E 1(| th | > th∗ )|S, H0h

 h(S ) 

= E E 1(| th | > th∗ )|S, H0

(15.2)

Intuitively, if we use the data to select a promising subgroup effect, then it is biased toward being significant, since we are using the same data to test the hypothesis that we used to uncover it. Instead, assume we have two equally sized splits of the data, S1 and S2 , where we use S1 to learn a promising subgroup and S2 to test it. Our false-positive rate is

 h(S )  E E 1(| z| > z∗ )|S2 , H0 1 By decoupling selection and testing of the hypothesis, this procedure returns a valid test:

 h(S )  E E 1(| z| > z∗ )|S2 , H0 1 

= E 1(| th | > th∗ )|H0h . h(S )

To see this, first fix S1 , which fixes H0 1 . In this setting, any false positive is attributable to variance in S2 , which returns a valid test. Since we can imagine, then, sampling over S1 repeatedly, the test is valid, since any connection between selecting the hypothesis and testing it is broken. In practice, we have shown that a valid test of a null hypothesis generated from a split sample will achieve the nominal error rate. There are, of course, several practical

Subgroup Analysis

issues that may interfere with this result. First, there may be something about the particular experiment that is not representative of the full population. Irregularities in administering the experiment will bias any estimate and invalidate the procedure. Second, this method falls prey to issues of multiple testing as much as any other method. The preferred approach, which we recommend, is to prespecify how many tests are anticipated and then Bonferroni-correct the p-values.

15.3.3.1 adjusting for multiple hypotheses The false-positive rate is the probability of a false positive on a single hypothesis over repeated samples. The family-wise error rate is the probability of achieving a false positive among a family of K hypotheses. Define the number of statistically significant effects from a set of K hypotheses h ∈ HK as V =



1(| th | > th∗ ).

h∈HK

Denote H0K as the composite null hypothesis that every hypothesis in HK is true. The family-wise error rate is 

E V > 0|H0K Of course, as we test more hypotheses, the probability of a false positive increases. The simplest way to adjust for this issue is with a Bonferroni correction. This is a simple but valid method for adjusting for multiple hypotheses. The correction simply involves replacing a p-value threshold of p∗ with p∗ /K , and doing so allows for control of the familywise error rate (e.g. Esarey and Summer 2015). 15.3.4 Controlling the False-Discovery Rate The previous discussion has focused on testing a particular hypothesis, whether it is specified in advance or learned from the data. We turn now from a hypothesis-testing framework to a hypothesis-discovery framework. The goal shifts from rejecting a particular

279

null hypothesis to estimating a set of possible effects, but controlling the proportion of effects that are false. Define as R the number of selected effects. The false-discovery rate is E {V /R} = E {V /R|R > 0} Pr(R > 0), where the second form of the expression excludes the case where R = 0. The standard method for controlling the false-discovery rate is the Benjamini–Hochberg procedure. The method works in the following fashion. Given a large set of hypothesis and a desired false-discovery rate α, the Benjamini– Hochberg procedure works in two steps. First, the K estimated p-values on each test are ordered from smallest to largest, as {p(1) , p(2) , . . . , p(K) }. Then, the selected set consists of the tests associated with p-values such that   k (15.3) k : p(k) < α K This procedure guarantees a false-discovery rate on the set of discovered hypotheses of below α. The acceptable false-discovery rate should be specified in advance; 0.1 and 0.05 are standard values. The discovered effects should be considered as a group, as we cannot make claims about any one hypothesis. The goal here is to reduce thousands of possible effects to a manageable number that can be explored in the next round of experimentation. The full set of hypotheses and acceptable falsediscovery rate should be specified in advance. 15.3.5 Estimation Advice In this section, we provide advice on particular algorithms that can be used for subgroup analysis. As these methods are frequently updated, we expect that these concrete recommendations may go out of date at some point. Therefore, we strive here and above to highlight our reasoning as well as our recommendations. If the goal is using a model-based machine learning method in order to learn subgroup

280

Marc Ratkovic

effects, then the researcher should not implement methods that do not engage in sample splitting or at least make some differentiation between estimating the treatment effect and prediction. Examples include off-the-shelf predictive tools like random forests or their Bayesian variants (e.g. Hill and Jones 2014) and superlearners that average over multiple methods (e.g., Grimmer et al. 2017). While these are excellent for prediction, they are not optimized for uncovering subgroup effects, nor are they properly tuned for inference. If the experiment has a single, binary treatment, we recommend implementing the generalized random forest algorithm of Athey et al. (forthcoming). The method uses a split-sample approach, with one sample to learn a tree structure and the other for inference. This process is repeated and embedded within a random forest, such that all of the data are used at some point to either learn a heterogeneous tree structure or to estimate a treatment effect. We implement this method in an application below. The case of multiple treatments does not lend itself well to the tree and forest approach, since any number of heterogeneous effects now exist and may need estimation. Here, we suggest using some version of a sparse model combined with a split-sample approach in order to find an effect. The method would split the data and model the outcome in terms of the covariates on one half of the data. A large set of controls, interactions, and higher-order terms should be included. In the second set, the partialed-out outcome should be regressed on a large set of covariates composed of treatment × covariate interactions. We recommend a sparse model, such that it takes a large number of these interactions and returns some subset of the most relevant. The standard least absolute shrinkage and selection operator (LASSO) (e.g., Hastie et al. 2013) does not return standard errors, so we offer two alternatives. First, recent work uses the LASSO to select covariates, but then runs least squares on a subset of the selected covariates. The confidence intervals and p-values on these uncovered effects are valid (Belloni et al.

2017). A second set of methods use a Bayesian framework to recover a subset of relevant effects; see Ratkovic and Tingley (2017) for a full discussion and extensions. If you have a large number of subgroups or treatment–covariate interactions, such as tens of thousands or more, we recommend the Benjamini–Hochberg approach. We have seen recent interest in the method (e.g., White et al. 2018), and we encourage its widespread adoption in situations where a potentially vast number of subgroups exist.

15.4 Simulation Evidence and Applied Example We next illustrate a few of the concepts discussed above. In a simulation setting, I compare two machine learning methods that have been used for identifying effect heterogeneity: Bayesian additive regression trees (BARTs Hill et al. 2011), which do not implement a split-sample and partialing-out approach, and causal forests (CF; Athey et al. forthcoming), which do. We find that the latter returns subgroup estimates that are notably less biased and confidence intervals that are more reliable. We then include two applications, one to an audit study (see Chapter 3 in this volume) and one to a conjoint analysis (see Chapter 2 in this volume).

15.4.1 Simulation Evidence We illustrate the basic insight that the splitsample approach combined with partialing out can reduce bias and lead to more reliable inference. For this simulation, we generate a set of three covariates, each independent and identically standard normal, denoted {Xi1 , Xi2 , Xi3 }. Denote Si = sign(Xi1 + Xi2 ), so that it takes a value of +1 if the sum is positive and −1 if the sum is negative. We use two different simulation settings that vary with how we generate the binary treatment variable Zi . In the first, we generate the treatment as a coin flip; in the second, the treatment probability is a function of Si :

Subgroup Analysis

Setting 1: Zi |Si ∼ Bern(πi ); πi = 0.5 Setting 2: Zi |Si ∼ Bern(πi ); πi =

1 1 + exp(−Si )

In each setting, we generate the outcome as Yi = Si + i where i is itself standard normal and independent of the covariates. Note, importantly, that the causal effect for each observation is zero: Zi does not enter the outcome. By this means, any effect we find as significant is a false positive. We compare the performance of these methods on the estimation and inference of six different subgroups, given by {Xi1 > 0, Xi1 ≤ 0, Xi2 > 0, Xi2 ≤ 0, Xi3 > 0, Xi3 ≤ 0}. We implement two methods: the CF of Athey et al. (forthcoming) and BARTs, which have been used in the past for subgroup analysis and causal estimation (e.g., Green and Kern 2012; Hill et al. 2011, with and without sample splitting, respectively). The methods are similar in that they both rely on a collection of trees to model the outcome (see, e.g., Hill and Jones 2014; Montgomery and Olivella 2018) and are designed to identify discontinuities in the data like those induced by Si . The methods differ fundamentally in how they model the treatment effect: CF utilizes a split-sample strategy, while BARTs use the full sample to model the outcome given the treatment and covariates. To recover the treatment effect from BARTs, we have the method predict the outcome by the treatment and covariates. We then estimate the treatment effect using these values and use the posterior to generate uncertainty. The intuitive arguments given above suggest that BARTs will exhibit more bias than CF in this second setting for two reasons: first, BARTs do not model the treatment as a function of the covariates, and they use the same data to both model the outcome and

281

generate estimates and uncertainty intervals. CF, on the other hand, splits the data in half, using half to learn a forest structure to model the treatment effect and the other half to generate point and uncertainty estimates. This second procedure should generate estimates with less bias and better coverage, which we do indeed see. We start with results on point estimation, presented in Figure 15.1. The top graph in Figure 15.1 contains the results from Simulation 1, where the treatment is a coin flip, and the bottom graph contains the results for Simulation 2, where the treatment probability varies with Si . Both methods perform well when the treatment probability is constant across units. In the bottom graph in Figure 15.1, giving the results from Simulation 2, we see that neither method is unbiased across subgroups. CFs, though, have about half the bias of BARTs. Note that we are not advocating exclusively for the CF algorithm per se, as algorithms and methods are constantly improving, but arguing that the splitsample approach should be preferred in a subgroup analysis, as it leads to more reliable inference. Bias is not the whole story, though. We are focused on inference, asking whether we can trust our confidence intervals from a subgroup analysis. The results are shown in Table 15.1. The first two rows contain coverage results for the first simulation setting and the bottom two rows contain coverage results for the second setting. We see that both methods achieve nominal coverage when the treatment probability is homogeneous. In the second setting, we see deterioration by both methods. The 90% confidence intervals for CFs still cover the truth about 60% of the time, while the 90% posterior intervals only contain the truth about 10% of the time. Put differently, in this setting where there is no treatment effect, a 90% credible interval from BARTs that does not engage in a split-sample approach will produce a false positive about 90% of the time. For CFs, when using a confidence interval derived by a split-sample approach, the false-positive rate drops to about 40% – not perfect, but notably better.

282

Marc Ratkovic Simulation 1

0.0

0.1

Estimate 0.2 0.3

0.4

BART Causal forest

1

2

3

4

5

6

Subgroup

Simulation 2

0.0

0.1

Estimate 0.2 0.3

0.4

BART Causal forest

1

2

3

4

5

6

Subgroup

Figure 15.1 Subgroup estimates. Estimated treatment effect by subgroup for Bayesian additive regression trees (BARTs) (gray) and causal forests (white). Both methods are an average of trees, and there are only three variables, so both should perform well on these data. We see in Simulation 2 that causal forests have only about half the bias of BART.

Table 15.1 Coverage of 90% uncertainty intervals by subgroup. Subgroup 1

2

3

4

5

6

Simulation 1

BART Causal forest

0.94 0.94

0.93 0.87

0.91 0.89

0.92 0.90

0.95 0.93

0.92 0.92

Simulation 2

BART Causal forest

0.11 0.59

0.13 0.63

0.12 0.61

0.12 0.61

0.10 0.63

0.09 0.55

The first two rows contain coverage results for the first simulation setting and the bottom two rows contain coverage results for the second setting. We see that both methods achieve nominal coverage when the treatment probability is homogeneous. In the second setting, we see deterioration by both methods. The 90% confidence intervals for causal forests still cover the truth about 60% of the time, while the 90% posterior intervals only contain the truth about 10% of the time.

15.4.2 Applied Example: A Field Experiment with a Single, Binary Treatment In a recent audit study, Butler and Broockman (2011; see also Chapter 4 in this volume)

emailed US state legislators, varying whether the email was sent from a constituent with a stereotypically Black name (DeShawn) or White name (Jake); see the original paper for a full description of the design.

283

Subgroup Analysis

Table 15.2 Subgroup effects from audit experiment. Estimate

SE

z

n

−0.06 0.02 0.12 0.06 −0.04

0.02 0.02 0.05 0.09 0.02

−2.80 0.80 2.47 0.69 −2.33

2170 2689 349 141 4269

0.02

−2.85

2114

0.05 0.10 0.02

2.60 0.60 −0.46

343 115 2155

Main effects Republican Democrat Black Latino White

Interaction effects: Republican × White

−0.06

Interaction effects: Democrat × Black 0.13 Latino 0.06 White −0.01

Estimated effects by subgroup for the Butler and Broockman study. All subgroups with under 100 people were omitted. The Bonferonni-corrected critical value for the z statistic is 2.54.

The outcome we consider is whether the email received a reply, where we take DeShawn as the treatment condition and Jake as the control condition. We consider subgroups given by party of the legislator (Republican, Democrat) and the legislator’s race (White, Black, Hispanic). We omit any subgroups that do not include at least 100 respondents, leaving us with the five main effects (Republican, Democrat, White, Black, Hispanic) and four interactive effects (White Republicans, White Democrats, Black Democrats, and Hispanic Democrats) for a total of nine overlapping subgroups.3 In order to adjust for making nine comparisons, we use a Bonferroni correction, so we lower our p-value from 0.1 to p∗ = 0.1/9 = 0.011, which raises our critical value on the z-statistic on the difference in means from 1.64 to 2.54. The key attributes of this process for preregistering are listing the number of subgroups, the threshold for dropping subgroups due to sample size, and that a Bonferroni correction will be used.

3 Since the vast majority of Republicans are White, dropping small subgroups removes Black and Hispanic Republicans from our analysis.

We estimate the causal effect for each subgroup using a CF. The results by subgroup can be found in Table 15.2. After the Bonferroni correction, we find three statistically significant effects. First, we find that Republicans are less likely to return an email from the DeShawn condition. We also find a practically identical effect for White Republicans, largely because 97.4% of Republican legislators are White. While Republicans are less likely to respond to DeShawn, we find no significant main effect for Democrats. We find, though, that Black Democratic legislators are more likely to respond to DeShawn than to Jake. Our results correspond with what was found in the original study, with two exceptions. First, we find that our Bonferroni correction eliminates one result that was found to be marginally significant (p = 0.07) in the original study: that White Democratic legislators are more likely to respond to Jake. Our findings on White Republicans are the same as those reported in the original study, and we find the effect attributed to minority Democrats (Blacks and Hispanics pooled) to be driven by Black legislators. The central methodological argument, though, is that p-values should be adjusted when conducting multiple tests on subgroups.

284

Marc Ratkovic

15.4.3 Applied Example: A Conjoint Experiment with Multiple Treatment Conditions We next reanalyze data from a conjoint experiment. The original analysis considered the effect of varying dimensions of a international climate agreement on respondent preferences given for three countries the (UK, the USA, and Germany); see Bechtel and Scheve (2013) for a complete discussion of the design. Proposals were varied by the expected costs, how many countries participated, level of sanctions levied against violators, level of cuts required, organization monitoring compliance, and whether costs would be distributed proportional to historical or current pollution rates. Moderators collected from respondents include gender, ideology, age, nationality, and whether the respondent was likely to engage in reciprocity, as measured in a two-player public goods game after the experiment. We conduct our subgroup analysis in two stages: we use half of the respondents to learn interesting subgroups and the other half to test them. For the sake of preregistration, we would characterize the set of subgroups we are considering and register the seed we are using to split respondents, the statistical method we are using to learn them in the first split, and that we are Bonferronicorrecting for the number selected in the second split. In particular, we consider all treatment × moderator combinations, resulting in 368 possible subgroups being assessed. We then separate the observations into two equal splits. On the first split, we use the LASSO model of Belloni et al. (2017), which returns a subset of subgroups with estimated nonzero effect (for an overview of variable selection methods, see Ratkovic and Tingley 2017). We then enter the selected subgroup covariates into a regression model on the second split, returning standard errors clustered by respondent and Bonferroni-adjusting the p-values for the number of selected covariates. When constructing the covariates to measure a subgroup effect, we generate covariates that capture the causal effect of

a given variable in a given subgroup (see, e.g., Bansak 2020 for more). The covariates are constructed such that, within each subgroup, the treatment group and control group are contrasted; outside the subgroup, the covariate is set to zero. Specifically, let subgroups be denoted by indicator variables gis , s ∈ {1, 2, . . . , S} where  gis =

1; observation i in subgroup s 0 otherwise (15.4)

Consider the indicator variable for observation i, denoted Zik , k ∈ {1, 2, . . . , K }. Denote as Z ks the mean number of observations in subgroup s receiving treatment k. Then, our covariate for estimating the effect of treatment k on subgroup s is  xisk =

Zik − Z ks ; Zis = 1 0; gis = 0

(15.5)

This covariate is constructed such that the coefficient from regressing the outcome on this covariate gives the difference in means in that subgroup. It does so by zeroing out observations of the subgroup, but creating a contrast between the treated and controls in that subgroup. Note that this covariate could be constructed instead to contrast those in treatment condition k to some other condition. This covariate treats all of the treatment levels except for k as the baseline, so coefficients on this covariate should be interpreted as the mean difference between the given treatment and the average of all other treatment levels for this variable. For example, we find a coefficient on United.States × Cost: $267 of 0.029. This should be interpreted as, among respondents in the USA (the subgroup), treaties that cost $267 per capita were less favored than the average across other possible levels by 2.9 percentage points. The results from this procedure can be found in Table 15.3. The left column contains the results from a regression on the selected covariates from the first split of the data; the right column contains the same results from

285

Subgroup Analysis

Table 15.3 Split-sample estimates from conjoint analysis. Dependent variable: Support of climate change treaty (1)

(2)

Main effects Cost: $53 Cost: $ 107 Cost: $ 213 Cost: $267 Only rich countries pay 20 of 192 participants 160 of 192 participants Sanctions: $11 Sanctions: $43 Indep. Commission Monitors

0.118 (0.015) 0.079 (0.009) −0.092 (0.009) −0.096 (0.015) −0.051 (0.008) −0.010 (0.012) 0.020 (0.014) 0.026 (0.009) −0.022 (0.008) 0.034 (0.007)

0.127 (0.015) 0.066 (0.008) −0.100 (0.008) −0.120 (0.015) −0.049 (0.007) −0.021 (0.012) 0.028 (0.014) 0.044 (0.009) −0.025 (0.008) 0.037 (0.007)

0.029 (0.014) −0.047∗∗ (0.014) 0.018 (0.009) 0.029 (0.012) −0.079∗∗∗ (0.012) 0.020 (0.013) −0.015 (0.011) 0.021 (0.014) −0.020 (0.014) 0.012 (0.013) −0.056∗∗∗ (0.013) 0.020 (0.010) 0.039 (0.014) 0.051∗∗∗ (0.011) −0.017 (0.013) 0.043∗∗ (0.014) −0.078∗∗∗ (0.013) 0.005 (0.010) −0.041∗∗∗ (0.008) −0.038∗∗ (0.012) 0.037 (0.015) −0.071∗∗∗ (0.017) −0.064∗∗∗ (0.015) −0.041∗ (0.014)

0.050∗∗∗ (0.014) −0.026 (0.013) 0.004 (0.010) 0.016 (0.012) −0.074∗∗∗ (0.011) 0.004 (0.012) −0.001 (0.011) 0.002 (0.014) −0.010 (0.013) 0.021 (0.013) −0.040∗∗ (0.013) 0.033∗∗ (0.010) 0.050∗∗∗ (0.014) 0.047∗∗∗ (0.011) −0.046∗∗∗ (0.013) 0.036 (0.014) −0.084∗∗∗ (0.014) 0.009 (0.010) −0.042∗∗∗ (0.008) −0.045∗∗∗ (0.012) 0.005 (0.016) −0.042 (0.017) −0.029 (0.016) −0.033 (0.014)

Interaction effects Female × Cost: $53 Female × Cost: $267 Female × 80% of emissions cut Female × Sanctions: $11 Conservative × Greenpeace Monitors Liberal × 160 of 192 participants Liberal × Your government monitors Reciprocity: high × Cost: $53 Reciprocity: high × Cost: $267 Reciprocity: high × 160 of 192 participants Reciprocity: high × 20 of 192 participants Reciprocity: high × 80% of emissions cut Env: low × Cost: $53 Env: low × Sanctions: None Env: low × Sanctions: $43 Env: high × 160 of 192 participants Env: high × 20 of 192 participants Env: high × 80% of emissions cut Env: high × 40% of emissions cut Env: high × Your government monitors United.Kingdom × Cost: $53 United.Kingdom × Cost: $267 United.States × Cost: $267 United.States × Only rich countries pay

Note: ∗ p < 0.1; ∗∗ p 0} and the realized treatment assignment Z = (Z1 , . . . , ZN ) is a random vector with support  and Pr(Z = z) = pz . For example, with a population of size N = 10 and an experimental design that randomly assigns without replacement a proportion p = 0.2 to treatment condition zi = 1 with uniform N = 45 possible probability, there are pN treatment assignments (|| = 45) and the 1 . realized treatment assignment Z has pz = 45 The experimental design characterizes precisely the probability distribution of the assigned treatments. In experiments, this is determined by the researcher and is therefore known. To analyze the effect of different treatment assignments, we compare the different outcomes they produce. These potential outcomes are defined for each unit i as the elements in the image of a function that maps assignment vectors to real valued outcomes, yi :  → R. Particularly, yi (z) is the response of unit i to assignment z. For convenience, let z−i = (zi , . . . , zi−1 , zi+1 , . . . , zN ) denote the (N − 1)-element vector that removes the ith element from z. Then, the potential outcome yi (z) can equivalently be expressed as yi (zi ; z−i ). Continuing with the example of the cash transfer program above, this quantity would be the potential consumption of household i given its assignment as a transfer recipient or non-recipient (zi ) and the treatment assignment of all other households (z−i ), including those inside and outside household i’s village. Traditional analyses of experiments, and other chapters in this volume, assume no interference, in which case the potential

Spillover Effects in Experimental Data

outcome yi (z) is restricted to being affected only by i’s own treatment. That is, with no interference, for any two treatment assignments z and z , for which zi remains unchanged, we have yi (zi ; z−i ) = yi (zi ; z−i ) for all i ∈ {1, . . . , N }. When interference is present, there exist some units i ∈ U for which yi (zi ; z−i ) = yi (zi ; z−i ); that is, fixing the treatment of i while changing other units’ treatment results in changes to i’s outcome. Let Yi denote the observed outcome of unit i, where the observed outcome is related to the potential outcomes as Yi = yi (Z) = yi (Zi ; Z−i ), where Z−i denotes the vector Z net of its ith element. In the case of no interference, Yi = yi (Zi ). Therefore, when interference is present, we need to account for others’ treatment assignments as well.

16.4 Arbitrary but Known Interference Networks This section reviews estimation methods in a setting where interference occurs over a network of arbitrary structure, but this structure is known. The analysis follows Aronow and Samii (2017). We represent a unit’s set of interfering units in terms of network ties. Then, depending on the network structure and the treated units’ network characteristics, different treatment assignments may result in different and arbitrary, but known, patterns of interference. For example, assuming that interference happens through direct ties between units, treating any one unit in a fully connected network generates a pattern in which the treatment of that one unit interferes with the treatment of every other unit in the network. In a regular lattice, the treatment of any one treated unit interferes only with the treatment of that unit’s four nearest neighbors, and in an irregular network, treatment assignments that treat units with many direct ties generate more interference than assignments that treat units with just a few ties. As in the anti-conflict social network experiment of Paluck et al. (2016), these methods require the researcher to measure the network or to have comprehensive

293

information about the connections between experimental units and to define precise causal effects that reflect the possible types of treatment exposures that might be induced in the experiment, which in turn requires the researcher to make specific assumptions about the extent of interference. The goal is to estimate exposure-specific causal effects – for the anti-conflict program, for example, we might estimate effects on students for whom at least one peer is a direct program participant, or for whom exactly two peers are participants, etc. Knowing the treatment assignment distribution allows one to account for potential sources of confounding that arise from heterogeneity across units in their likelihood of falling into different exposure conditions (e.g., heterogeneity in terms of students’ number of connections with other students). The subsections below explain. 16.4.1 Exposure Mapping To determine each unit’s treatment exposure under a given treatment assignment, Aronow and Samii (2017) define an exposure mapping that maps the set of assignment vectors and unit-specific traits to an exposure value: f :  × → , where θi ∈ quantifies relevant traits of unit i such as the number of direct ties to other units in the network and, possibly, the weights assigned to each of these ties. The set contains all of the possible treatment-induced exposures that may be generated in the experiment, and its cardinality depends on the nature of the interference. For example, with no interference and a binary treatment, the exposure mapping ignores unit-specific traits f (z, θi ) = zi , producing two possible exposure values for each unit: no exposure (or control condition, zi = 0) and direct exposure (or treatment condition, zi = 1), in which case

= {0, 1}. Now, consider interference that occurs through direct peer connections. Then, θi is a column vector equal to the transpose of unit i’s row in a network adjacency matrix (which captures i’s direct connections to other units), and the exposure mapping f (z, θi ) can be simply defined to capture direct exposure to treatment – or

294

Peter M. Aronow, Dean Eckles, Cyrus Samii, and Stephanie Zonszein

the effect of being assigned to treatment – and indirect exposure – or the effect of being exposed to treatment of peers.1 An example of such an exposure mapping (and by no

means the only possibility) is the following, whereby indirect exposure occurs when at least one peer is treated:

⎧ ⎪ d11 (Direct + Indirect Exposure): zi I(z θi > 0) = 1, ⎪ ⎪ ⎪ ⎪ ⎨d10 (Isolated Direct Exposure): zi I(z θi = 0) = 1, f (z, θi ) = ⎪ (1 − zi )I(z θi > 0) = 1, d01 (Indirect Exposure): ⎪ ⎪ ⎪ ⎪ ⎩d (No Exposure): (1 − zi )I(z θi = 0) = 1. 00 For this particular case, = {d11 , d10 , d01 , d00 }. This characterization of exposures is “reduced form” in that it does not distinguish between the mechanisms through which spillover effects occur. Specification of the exposure mapping requires substantive consideration of the data-generating process. Manski (2013) discusses subtleties that arise in specifying exposure mappings. For example, the author shows how models of simultaneous endogenous choice (due to homophily or common external shocks) can produce restrictions on the potential outcomes yi (dk ), therefore implying that potential outcomes may vary in ways that an otherwise intuitive exposure mapping may fail to capture. Because units occupy different positions in the interference network, their probabilities of being in one or another exposure condition vary, even if treatment is randomly assigned. Insofar as network position also affects outcomes, then such differences in exposure probabilities need to be taken into account when estimating exposurespecific causal effects. Otherwise, the analysis would be confounded. We show here that when the random assignment mechanism is known, then these exposure probabilities are also known. This allows one to condition on the exposure probabilities directly. To see this, define the exposure that unit i receives as Di = f (Z, θi ), a random variable with support i ⊆ and for which Pr(Di = d) = πi (d). For each unit i there 1 Note that the meanings of “direct” and “indirect” in the interference setting are different from those in the mediation setting reviewed in Chapter 14 in this volume.

is a vector πi = (πi (d1 ), . . . , πi (dK )) , with the probability of i being subject to each of the possible exposures in {d1 , . . . , dK }. Aronow and Samii (2017) call πi i’s generalized probability of exposure. For example, the exposure mapping defined above gives rise to πi = (πi (d11 ), πi (d10 ), πi (d01 ), πi (d00 )) . We observe the unit-specific traits (θi ) necessary to define exposures for any treatment assignment vector, and the probability of each possible treatment assignment vector (pz ) is known. This allows us to compute πi (dk ) as the expected proportion of treatment assignments that induce exposure dk for unit i. When the set of possible treatment assignment vectors  is small, this can be computed exactly. When  is large, one can approximate the πi (dk ) values with arbitrary precision by taking a large number of random draws from . Aronow and Samii (2017) discuss considerations for how many draws are needed so as to keep biases small. This Monte Carlo method may in some cases require a prohibitive number of draws (e.g., if | | is large), but for some specific designs and exposure mappings it may be possible to compute the πi (dk ) values via a dynamic program, as in Ugander et al. (2013). The following toy example illustrates how to compute the exposure received by each unit and the generalized probability of exposure using the interference package for R (Zonszein et al. 2019). Suppose we have a set of N = 10 units and we randomly assign (without replacement) a proportion p = 0.2 to treatment condition zi = 1 with uniform probability. In this case, the realized treatment assignment Z shows that units 6 and 9 are directly treated.

Spillover Effects in Experimental Data

295

N