Becoming A Behavioral Science Researcher: A Guide To Producing Research That Matters [2nd Edition] 1462538797, 9781462538799, 1462541283, 9781462541287, 1462541305, 9781462541300

Acclaimed for helping novice behavioral scientists hit the ground running as producers of meaningful research, this text

906 139 7MB

English Pages 364 [379] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Becoming A Behavioral Science Researcher: A Guide To Producing Research That Matters [2nd Edition]
 1462538797,  9781462538799,  1462541283,  9781462541287,  1462541305,  9781462541300

Citation preview

ebook THE GUILFORD PRESS

Becoming a BEHAVIORAL SCIENCE RESEARCHER

Also Available

Principles and Practice of Structural Equation Modeling, Fourth Edition Rex B. Kline

Becoming a BEHAVIORAL SCIENCE RESEARCHER A Guide to Producing Research That Matters SECOND EDITION

Rex B. Kline

THE GUILFORD PRESS New York  London

For Joanna, Julia, and Luke

Copyright © 2020 The Guilford Press A Division of Guilford Publications, Inc. 370 Seventh Avenue, Suite 1200, New York, NY 10001 www.guilford.com All rights reserved No part of this book may be reproduced, translated, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission from the publisher. Printed in the United States of America This book is printed on acid-free paper. Last digit is print number: 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data Names: Kline, Rex B., author. Title: Becoming a behavioral science researcher : a guide to producing   research that matters / Rex B. Kline. Description: Second edition. | New York : Guilford Press, [2020] | Includes   bibliographical references and index. Identifiers: LCCN 2019010917| ISBN 9781462538799 (pbk.) |   ISBN 9781462541287 (hardcover) Subjects: LCSH: Psychology—Research. Classification: LCC BF76.5 .K54 2020 | DDC 150.72—dc23 LC record available at https://lccn.loc.gov/2019010917

Preface and Acknowledgments

T

here have been major developments in the behavioral sciences since the publication of the first edition of this book in 2009. To summarize: 1. The replication crisis expanded from mainly an academic debate within psychology to a much broader public issue that now involves other disciplines, too. The crisis involves the growing problem of failed reproducibility of published research in psychology and other areas. 2. Awareness of the potential for p hacking, or how basically any result can be presented as statistically significant through decisions that are not always disclosed, has increased to the point where the credibility of our research literature is cast in doubt. That p hacking is an enabler of scientific fraud also contributes to a credibility crisis. 3. The use of significance testing was banned by a major journal in social psychology, and increasing numbers of journals, especially in health-­related fields, basically forbid the use of significance testing without estimating effect sizes.

v

vi

Preface and Acknowledgments

4. Reforms from the open-­science movement, including open access, open data, open peer review, and preregistration of analysis plans, are being implemented by more and more journals and government agencies that fund research. 5. The American Psychological Association has published revised journal article reporting standards for quantitative research and a new set of standards for qualitative research. Hundreds of other reporting standards for research in other areas have also been published by additional groups or associations. Such standards address a reporting crisis where insufficient information about the analysis is presented in too many journal articles. Today’s thesis students need to know about these five developments. This is why there are new or extensively revised chapters in this second edition that deal with research crises (Chapter 3), reporting standards (Chapter 4), open science (Chapter 5), statistics reform (Chapter 6), effect size estimation (Chapter 7), and psychometrics (Chapter 8). Some challenges for thesis students are relatively unchanged; that is, they are still germane. To summarize: 1. Even after a few introductory courses in statistics and research design, thesis students often feel ill prepared when analyzing data from their own projects. This is because practical analysis skills are typically not emphasized in such courses, and there are often surprising gaps in knowledge of basic statistical principles among many, if not most, new graduate students. That such knowledge gaps are also apparent among even established researchers may be less well known but is just as real. 2. Many thesis students struggle with writing, in part because university students are not assigned meaningful writing tasks in too many courses at the undergraduate level. This means that students in even specialization or honors programs can enter a thesis course with essentially little or no refinement in their writing skills since high school. 3. Most PowerPoint presentations seen in classes, colloquia, or other contexts in educational or business settings are awful. This means that students see mainly bad examples of PowerPoint presentations, which makes it difficult for them to develop a sense of how to do better.



Preface and Acknowledgments vii

Chapter 2 deals with the trinity of research: design, measurement, and analysis. As in the previous edition, students are encouraged to (1) think about research, design, and measurement as parts of an integrated whole, and (2) understand that flaws in any individual part can affect the quality of results. Still emphasized in Chapter 9, about practical data analysis, is the idea of a minimally sufficient analysis, or the advice to use the simplest statistical technique that both addresses the hypotheses and is understood by the student. Readers are also informed of freely available computer tools for statistical analyses, including R, JASP, and PSPP. Chapter 10, on writing, has been updated to reflect the increasing reliance on reporting standards in the review process for manuscripts submitted to journals. Finally, there is an even stronger emphasis in Chapter 11, about presentations, on designing slides that are simple, effective, and free of the usual distracting clutter seen in so many PowerPoint presentations that annoy more than enlighten due to flawed organization, preparation, or visual aids. Earlier drafts were much improved based on extremely helpful comments and suggestions by a total of seven reviewers. The names of six reviewers were revealed to me only after the writing was complete, and their original comments were not associated with their names. The seventh reviewer contacted me about the status of the second edition and then graciously volunteered to review chapter drafts. A big “thank you” to all reviewers: •• Julie Combs, College of Education, Educational Leadership, Sam Houston State University •• Anne Corinne Huggins-­Manley, College of Education, School of Human Development and Organizational Studies in Education, University of Florida •• Michael Karcher, College of Education and Human Development, Counseling, University of Texas at San Antonio •• Craig D. Marker, College of Health Professions, Clinical Psychology, Mercer University •• Diane Montague, School of Arts and Science, Psychology, LaSalle University •• James Schuurmans-­Stekhoven, School of Psychology, Charles Sturt University, Bathhurst, New South Wales, Australia

viii

Preface and Acknowledgments

•• Rachel A. Smith, College of the Liberal Arts, Communication Arts and Sciences, Pennsylvania State University It was both a pleasure and an honor to work again with C. Deborah Laughton, Research Methods and Statistics Publisher at The Guilford Press. An author could not hope to have a better ally, but “guardian angel” is probably a better term here. Issues of permissions were dealt with quickly—­and also with good style and wry humor—­by Robert Sebastiano, Permissions Coordinator at Guilford. It was good to work with Guilford Senior Production Editor Laura Specht Patchkofsky and Art Director Paul Gordon, who designed the terrific book cover. The original manuscript for the second edition was expertly copyedited with many helpful suggestions by Betty Pessagno. I am grateful to the many psychology honors program students in my sections of our year-­long thesis course at Concordia. They have shared with me their aspirations and frustrations in becoming more skilled researchers, and their experiences provide the background for many chapters in this book. It has been rewarding for me to have played even a small role in helping them to build the foundations for later professional careers. As always, my deepest gratitude goes to my wife, Joanna, and our children, Julia and Luke. Thank you for all your love and support during the writing of this book.

Contents

PART I.  PROMISE AND PROBLEMS  1. Introduction

3

Transitions 3 Not Yet Ready for Prime Time  4 Write and Speak  5 What Thesis Students Say They Need  7 Career Paths for Behavioral Scientists  7 Plan of the Book  10 Summary 13 Recommended Readings  13

 2. Research Trinity: Design, Measurement, and Analysis Overview 14 Design 15 Measurement 23 Analysis 25 Randomized Designs  27 Controlled Quasi‑Experimental Designs  33 Uncontrolled Quasi‑Experimental Designs  38 Longitudinal Designs  39 Cross‑Sectional Designs  41 Options for Comparing Nonequivalent Groups  43 Summary 45 Recommended Readings  45 Exercises 46 Answers 47

ix

14

Contents

x

 3. Crises

48

Psychology at the Fore  49 Crisis as Internal Conflict  50 Replication Crisis  53 Significance Testing Crisis  60 Reporting Crisis  65 Measurement Crisis  68 Science 2.0  69 Summary 70 Recommended Readings  71

PART II. REMEDIES  4. Reporting Standards: Quantitative, Qualitative, and Mixed Methods Research

75

Definitions of Reporting Standards  76 Quantitative, Qualitative, and Mixed Methods Research  77 APA Reporting Standards for Quantitative Research  80 APA Reporting Standards for Qualitative and Mixed Methods Research  95 EQUATOR Network and Other Reporting Standards  100 Summary 101 Recommended Readings  101 Exercises 101 Answers 102

 5. Open Science

103

What Is Open Science?  104 Open Access  106 Open Data  109 Open Source and Tools  112 Transparency Guidelines and Open‑Science Badges  114 Rays of Hope with More to Do  116 Summary 117 Recommended Readings  117

 6. Statistics Reform Review and Query  118 Rough Starts  119 What Statistical Significance Really Means  120 Big Five Misinterpretations  124 Other Confusions  127 Fantasy Land  129 Other Problems with Significance Testing  130 Big Nine Requirements  134 Reforms of Significance Testing  136 Best‑Practice Recommendations  140 Summary 144 Recommended Readings  144 Exercises 144

118

Contents xi

Answers 145

APPENDIX 6.A.  Significance Testing Glossary  146

 7. Effect Size

148

Size Matters  149 Characteristics of Effect Sizes  150 Comparing Two Groups on Continuous Outcomes  153 Comparing Groups on Continuous Outcomes in More Complex Designs  162 Partial Eta‑Squared  165 Effect Sizes for Dichotomous Outcomes  170 Extended Example  175 Clinical Significance  178 Say No to T‑Shirt Effect Sizes  179 Summary 181 Recommended Readings  182 Exercises 182 Answers 183 APPENDIX 7.A.  Source Tables for Two Factorial Examples  184

 8. Psychometrics

185

Chapter Scope and Organization  186 Classical Test Theory  187 Measurement Overview  188 Multiple‑Item Measurement  189 Composite Scores  192 Reliability Coefficients  193 Consequences of Low Reliability  196 Factors That Affect Reliability  197 Reliability Methods and Coefficients  198 The Prophecy Formula Revisited  207 Validity Methods and Coefficients  208 Reporting Standards for Psychometrics  213 Checklist for Evaluating Measures  213 Translating Tests  215 Modern Psychometric Methods  216 Summary 219 Recommended Readings  219 Exercises 220 Answers 220

PART III. SKILLS  9. Practical Data Analysis Vision First  223 Parsimony Next  224 Computer Tools  226 Error Checking  228 Data Screening  229 Proofread for Data Entry Errors  231 Assess Descriptive Statistics  233

223

Contents

xii

Check Distribution Shape  235 Evaluate the Nature and Amount of Missing Data  241 Inspect Bivariate Relations  245 Appraise Quality of Test Scores  250 Categorization Evils  252 Summary 252 Recommended Readings  253 Exercises 253 Answers 254

10. Writing

255

Getting Started  255 Maladaptive Beliefs  258 Style Guides  260 General Principles of Good Writing  261 Principles of Good Scientific Writing  268 Writing Sections of Quantitative Studies  270 Plagiarism 284 Ready for the Big Time  285 Summary 286 Recommended Readings  287 Exercises 287 Answers 288

11. Presentations

289

80/20 Rule for Presentations  290 Target the Audience  291 Hone the Message  292 ABC Style and Death by PowerPoint  295 More PowerPoint and Other Presentation Mistakes  298 Minimalism as an Alternative Style  301 Handouts (Takeaways)  305 Presenting Yourself  308 Stage Fright  310 Other Issues  311 Summary 313 Recommended Readings  315 Exercises 315 Answers 315 APPENDIX 11.A.  Example Thesis Student Slideshow  316 APPENDIX 11.B.  Example Handout  318

References

321



Author Index

343



Subject Index

350



About the Author

364

PART I

PROMISE AND PROBLEMS

CHAPTER 1

Introduction The only limit to our realization of tomorrow will be our doubts of today. Let us move forward with strong and active faith. —Franklin D. Roosevelt (undelivered address, April 13, 1945; quoted in Peters & Woolley, 1999–2018)

T

his book is for thesis students who are learning to conduct independent (but still supervised) research in the behavioral sciences, such as psychology, education, or other disciplines where empirical studies are conducted with humans or animals. Such students may be senior undergraduates who are completing an honors program or a specialization program with a thesis option, or they may be newly admitted graduate students at the master’s level. I assume that readers (1) have already taken at least one introductory course in both statistics and research methods and (2) are considering careers in which the ability to understand or produce research is important. The main goal is to help thesis students develop the cognitive and applied skills to eventually become capable and effective researchers in their own right.

Transitions Thesis students must navigate a few challenges as they complete their undergraduate-­level studies or enter graduate school (Mumby, 2011; Pearson, 2011). They must transition from 3

4

P r o m ise a nd P r o b l e m s

1. Attending large classes in which students can be relatively anonymous and have little direct contact with professors to being a highly visible member in a smaller research group with closer relationships to faculty members and student peers. 2. Being evaluated by individual instructors on mainly in-class examinations that involve little or no writing (e.g., multiple-­choice tests) to being evaluated by a research committee based on extensive writing (i.e., your thesis) and oral presentations, too. 3. Reading or listening about research produced by others with some understanding to doing so, with even stronger comprehension or conducting your own studies. 4. Being a relatively passive recipient of information from authority figures to someone who both takes in and disseminates knowledge through what you write and say (you become an authority figure, too). 5. Being aware of limitations with the way research is conducted or reported in the behavioral sciences to being capable of doing something about them (you learn to appreciate the need for reform and can act on it, too).

Not Yet Ready for Prime Time Even after completing basic (introductory) courses in statistics and research methods, students are usually not yet ready to carry out independent research projects. This observation is something that instructors of research seminars and supervisors of thesis projects know only too well, and students often feel the same way, too. Part of the problem is that there are some critical gaps in the knowledge and skills of students who undertake thesis projects. For example, students’ knowledge of basic concepts about statistics or research design is often rather poor, despite their previous coursework. Some possible reasons are outlined next: 1. Statistics and research methods are typically dealt with in separate courses and, consequently, their integration may not be emphasized. That is, analysis and design are presented outside the context of the other, but in real research projects they are integral parts of the same whole.



Introduction 5

2. Too many statistics classes cover mainly the old statistics, or estimation based on null hypothesis significance testing, which has serious flaws. Emphasized in more modern courses are the new statistics, or estimation based on effect sizes, confidence intervals, and meta-­analysis. The new statistics are an important part of statistics reform, in which routine significance testing is replaced by methods aimed at building a more cumulative research literature (Cumming, 2014; Kline, 2013). 3. Beginning in the 1980s, courses in measurement theory, or psychometrics, were dropped from many undergraduate and graduate programs in psychology (Aiken, West, Sechrest, & Reno, 1990). This development is unfortunate because strong knowledge of measurement is crucial for behavioral science researchers, especially if they analyze scores from psychological tests. Without a grasp of basic psychometrics, researchers may have trouble understanding whether those scores can be analyzed with reasonable precision (reliability) or support a particular interpretation (validity). A consequence of these problems is that students often have difficulty when it comes time to analyze data in their own research projects. They may experience a lack of confidence in what they are doing, or, worse, they may wind up conducting analyses on results they do not really understand. That is, students too often carry out the analyses in a relatively blind way such that they lose sight of the connections between the hypotheses (research questions), study design and procedures, and interpretation of the results. Students also tend to become overly fixated on the analysis and thus pay less attention than they should to other issues, including those of methods and measurement. The problems just described occur more often among undergraduate students, but many junior graduate students evidence similar difficulties, too.

Write and Speak Thesis students need to do a lot of writing, from a proposal before starting the project to the final version of the thesis. They may also be required to make presentations about their projects as part of research seminars or

6

P r o m ise a nd P r o b l e m s

thesis classes with students from other laboratories. Both of these forms of communication, written and oral, are critical skills for researchers. Many, if not most, scientists spend more time communicating about their work in the form of articles, grant applications, lectures, or as invited speakers than they do actually setting up experiments and collecting data. Indeed, communication is at the heart of science and a big part of a researcher’s everyday life (Feliú-Mójer, 2015). But thesis students are often unprepared to express themselves effectively in writing. This happens in part because few demands for writing may have been placed on them in earlier courses. For example, depending on the particular academic program and luck in course registration, it is possible to get a university degree without doing much, if any, serious writing. Thus, many students are simply unpracticed in writing before they enter a research seminar, thesis class, or graduate school. Even for students experienced in other types of writing, such as in the humanities or journalism, it is not easy to learn how to write research reports. This is because scientific writing has its own style and tenor that require extensive practice in order to master it. Students obliged to make oral presentations about their thesis projects are often given little guidance beyond specifying a time limit (e.g., 20 minutes), asking for coverage of particular content (e.g., project rationale, methods, and hypotheses), and maybe also showing them the basic workings of Microsoft PowerPoint or similar computer tools for creating and showing electronic slides. Yes, students see many PowerPoint presentations during course lectures, some of which may be experienced as pretty awful and trivial but others as more engaging and enlightening. For the reasons explained next, however, these experiences teach them little about how to prepare effective presentations. Amid hundreds of hours of lecture time, it is difficult for students to identify and articulate specific principles for making effective PowerPoint presentations based on hit-or-miss experiences as audience members. Consequently, it is not surprising that many students find oral presentations to be intimidating. They worry both about dealing with anxiety related to public speaking and about how to organize and display their content in PowerPoint. A few students eventually develop by trial and error an effective presentation style, but many others do not. As we all know, not all instructors are effective public speakers; thus, this statement is not an indictment directed specifically against students. Perhaps the period of trial-and-error learning could be reduced if students were offered more



Introduction 7

systematic instruction in how to make effective presentations, including both what to do and what not to do as a speaker, and in PowerPoint, too.

What Thesis Students Say They Need In my sections of our thesis course for psychology honors students, I ask them at the beginning of the semester, “Who are you, what do you want, and where are you going?” The “what do you want” part of the question concerns what they want to learn in the course besides the mechanics of submitting their theses at the end of the school year. Most students, about 75% or so, say that they want to learn how to better conduct their statistical analyses and interpret the results. About the same proportion say that learning how to make effective oral presentations is a priority. A somewhat smaller proportion, but still the majority, respond that they want to learn how to write a research paper for an empirical study. So the “big three” items on the students’ wish list concern the analysis and developing better communication skills (both written and oral). Other kinds of responses are typically given by a minority of the students. These include receiving information about graduate school, how to manage the logistics of a research project, how to make effective posters for presentation in a poster session, research ethics, and technical details of American Psychological Association (APA) style. The last-named refers to specifications for formatting manuscripts according to the sixth edition of the Publication Manual of the APA (APA, 2010). I make no claim that these results are representative, but I bet that many senior undergraduate students—­and junior graduate students, too—who conduct thesis projects would mention the same “big three” concerns as my students.

Career Paths for Behavioral Scientists At first glance, it may seem that most behavioral scientists work strictly in academia—­t hat is, as faculty members in universities. Some do, of course, but only a relatively small proportion of people with graduate degrees in psychology, education, or related areas go on to pursue academic careers. That’s a good thing because it is increasingly difficult to secure a tenure-­ track faculty position, given the increasing numbers of graduates with doctoral degrees but shrinking numbers of such positions. Estimates vary

8

P r o m ise a nd P r o b l e m s

by discipline, but only about 20% of people with doctoral degrees in the behavioral sciences eventually get a tenure-­track job, and the situation in engineering and the “hard” sciences is not all that different (Schillebeeckx & Maricque, 2013; Weissmann, 2013). Within psychology, the areas with the most favorable academic job markets include the neurosciences, quantitative, health, social–­ personality, and developmental psychology, while the areas with the worst prospects include industrial–­ organizational, clinical, and counseling psychology (Kurilla, 2015). Overall, the academic job market is highly competitive in that there are typically many more applicants than available positions, especially for tenure-­track slots. Those applicants face greater demands, too, than they did in the past. As noted by van Dalen and Klamer (2005) and others, universities now place more of a premium on research productivity and especially on the ability to secure funds from granting agencies than in the past. This emphasis works against “late bloomers” who did not discover a passion for research until later in their careers. In bygone days, some tenured professors did not really begin their academic careers until their early 40s. Starting at this age is quite rare now: the usual starting age of those with assistant-­level tenure-­track positions today is the late 20s or early 30s. Given the shortfall of tenure-­track faculty positions relative to the number of people who graduate with doctoral degrees, many qualified young scholars are forced to either abandon their aspiration for a fulltime faculty position or accept work as an adjunct or part-time faculty member. Such positions allow one to remain in an academic setting, but they may offer high teaching loads with little time for research, relatively low pay, few or no benefits, and little if any job security (Kurilla, 2015). In many undergraduate programs, part-time instructors teach most of the classes, thus creating a new faculty majority that is quite separate and unequal relative to tenure-­track faculty. About 30% of part-time faculty seek outside employment, such as teaching in multiple departments or schools, in order to make a living (Griffey, 2016). This situation has produced a kind of academic apartheid that is not likely to change in the near future because it saves universities money; that is, it costs less to pay parttime faculty than full-time, tenured faculty. It is a reality that one needs to plan for an academic position early in graduate school by (1) seeking out a supervisor who is a prolific researcher, (2) participating in research above and beyond one’s particular thesis project, (3) presenting papers or posters at scientific conferences, and



Introduction 9

(4) publishing research articles while still a student, not just after graduation. It also does not hurt to pick up some teaching experience while in graduate school, but not at the expense of getting your research done. Academia is a tough business, but it is better to consider a faculty position with your eyes wide open. But the potential rewards are great for those who believe they will thrive in academia, especially for energetic, creative, and self-­motivated people who love ideas and question conventional wisdom. Besides universities, behavioral scientists work in a wide range of governmental agencies or ministries, including those involved in health, education, transportation, engineering, criminal justice, statistics and standards, finance, and social services. Others work for nongovernmental organizations, such as those involved in human service delivery or public policy, or work in the private sector, including hospitals, marketing research firms, pharmaceutical companies, software development groups, manufacturing facilities, financial service organizations, and insurance companies. Some work as consultants, either as freelancers or as members of consultancy firms. The main clients of such firms are governments and businesses. Research training leaves graduates of behavioral science programs with marketable skills in a variety of careers outside universities. And, of course, work directly related to research is only part of what behavioral scientists do in these positions. Such responsibilities could involve actually carrying out research projects from start to finish. If so, then skills other than those directly related to design, analysis, and measurement are needed, including the ability to convey study rationale to nonresearchers (i.e., write a proposal for those who control project funds) and to work out project budget and personnel needs. University faculty members deal with the same issues whenever they write grant proposals. Another possibility includes working to evaluate research results generated by others but then conveying your recommendations, possibly to colleagues with no formal training in research but who count on your judgment. So, once again, the ability to communicate research findings in terms that are meaningful to nonresearchers or multidisciplinary audiences is crucial, both inside and outside universities. It helps that you really understand what your own results mean; otherwise, how can you explain them to others if you cannot first do so to yourself? This is why this book places so much emphasis on correct interpretations of statistical results and on statistics reform, too.

10

P r o m ise a nd P r o b l e m s

The ability to think critically about how evidence is collected and evaluated is especially important for those who work in human service fields, such as mental health, where there are unsubstantiated beliefs about associations between variables or the effectiveness of certain types of practices (Dawes, 1994): 1. Even well-­intentioned efforts at intervention can produce unexpected negative consequences later on—for example, the history of medicine has many instances where some treatment is later found to do more harm than good. A skeptical attitude about a proposed treatment may help to prevent such problems. 2. An empirically based, “show-me” perspective may also constrain otherwise less cautious practitioners from making extreme claims without evidence. It may also prevent fads from dominating professional practice. 3. It is relatively easy for professionals to believe, based on their experience, that they have special insight about the causes and mitigation of human problems. Such beliefs may be incorrect, however, and it could take longer to make discoveries if one does not value the role of evidence. 4. There is growing appreciation for the need to base practice on evidence-­based techniques in medicine, professional psychology, and education (APA Presidential Task Force on Evidence-­Based Practice, 2006; Shernoff, Bearman, & Kratochwill, 2017). Strong research skills are obviously relevant here.

Plan of the Book The organization of this book and the contents of its three parts are intended to address the issues just outlined about preparing thesis students for research-­based careers. We begin in the next chapter with review of fundamental principles that integrate research design with measurement and statistical analysis. Also elaborated is the association between each of the three areas just mentioned, with a particular type of validity concerning the accuracy of inferences. Considered in Chapter 3 are various problems and crises that beset the psychology research literature, including:



Introduction 11

1. The aforementioned measurement crisis, including the widespread failure of researchers to estimate and report the reliabilities of scores analyzed. 2. The reporting crisis, or the realization that critical information supporting replication, results synthesis (i.e., meta-­analysis), and scientific transparency is omitted in too many journal articles. 3. The replication crisis, or the apparent inability to replicate findings from many classical studies and the dearth of studies in the literature explicitly devoted to replication. 4. The significance testing crisis, or the ongoing controversy, now occurring in many disciplines, about the proper role of statistical significance testing, if any, in data analysis. The five chapters of Part II (Chapters 4–8) concern potential remedies for the various crises just described. Specifically, Chapter 4 deals with revised or new journal article reporting standards by the APA for quantitative research, qualitative research, and mixed methods research. The basic aim of such standards is to improve the quality, trustworthiness, and transparency of reporting on results from empirical studies. The difference between quantitative research and qualitative or mixed methods research is explained in this chapter, with greater emphasis on quantitative research. This is because most student research projects in the behavioral sciences are of the quantitative type, but this has been changing in some disciplines, such as education, where increasing numbers of students are using qualitative or mixed methods. Readers with little or no backgrounds in qualitative methods are thus introduced to them in Chapter 4. Enhanced transparency of scientific research is part of the open-­ science movement, described in Chapter 5. Also emphasized in open ­science is making research data and related resources, including scientific articles, more accessible to both professional and public audiences. The practice of open science may also reduce the likelihood of scientific fraud, an obnoxious reality in basically all research areas.1 Chapters 6 and 7 deal with aspects of statistics reform, including the controversy about significance testing and suggested alternatives and the importance of routinely describing the magnitudes of results, or effect 1 https://www.the-­scientist.com/tag/research-integrity/

12

P r o m ise a nd P r o b l e m s

size. Many journals, especially those in health-­related areas of research, and some journal article reporting standards now require the reporting of effect sizes when it is possible to do so (and it usually is). The basics of psychometrics are covered in Chapter 8, with emphasis on how to assess the precision or interpretation of scores from psychological tests and also on what to report in theses or journal articles about psychometrics. For too many thesis students, the material covered in this chapter may be the only substantial presentation about measurement theory they have encountered so far. Accordingly, the main goal of this chapter is to help you to make better choices about measurement in your own project, given this reality. Part III is devoted to skills, including data analysis and communication. As mentioned, theoretical issues are covered in many introductory statistics courses, but relatively little is said about how to manage a real analysis. This is why Chapter 9 deals with many practical issues in data analysis, such as the need to develop a clear analysis plan in which the simplest statistical technique that will get the job done is applied. That is, students are encouraged to resist the need to conduct too many analyses or analyses that are unnecessarily complicated, and thus not understood. There is also discussion on data screening, or how to prepare the data for analysis by checking for problems that, if undetected, could invalidate the interpretation of any results based on those data. Proper data screening is too often neglected by even established researchers, and many reporting standards call for complete disclosure of alterations made to the data that could have affected the results. How to write a manuscript-­length summary of an empirical study (including your thesis) is the subject of Chapter 10. Also discussed in this chapter are principles of good writing in general and more specific requirements for good scientific writing. Examples of common writing mistakes to avoid are offered. Considered in Chapter 11 are suggestions for making effective oral presentations while using PowerPoint or other computer tools for showing digital slides. How to plan and organize the presentation is discussed, and how to avoid mistakes in many, if not most, PowerPoint presentations is reviewed. Examples of more effective visual styles for slides are offered. How to deal with “stage fright,” or nervousness about public speaking, is also reviewed. Exercises with answers are presented in Chapters 2, 4, and 6–11 that involve analysis, measurement, or communication. Exercises that concern statistics or measurement have suggested answers, but you should first



Introduction 13

try to work out the solution before consulting the answers. Exercises for Chapters 4, 10, and 11 about, respectively, reporting, writing, and presentations concern your particular thesis project. These exercises are intended to assist you to write a proposal or make an oral presentation with slides about your research.

Summary The fact that many students who are about to conduct supervised research projects are not yet ready in terms of their conceptual knowledge and practical skills was discussed in this chapter. Specifically, thesis students need help with (1) developing a more complete sense of how design, analysis, and measurement complement one another; (2) conducting their statistical analysis and correctly interpreting the results; and (3) communicating to others in written and spoken form about their findings. It was also noted that there are many career tracks for those who become behavioral scientists. Some of these paths involve working in academia, but many others do not; indeed, the range of employment prospects outside universities is wide and includes governmental, commercial, educational, and other kinds of settings. Do you want to see if one of these paths might be in your future? Then let us begin by getting you ready. We do so in the next chapter with a review of essential concepts about research design, measurement, and analysis.

R E CO M M E N D E D R E A D I N G S Kuther and Morgan (2012) and Sternberg (2017) described various careers for students in psychology, while Carlson and Carlson (2016) did so for education majors. Vick, Furlong, and Lurie (2016) offered helpful suggestions for conducting an academic job search. Carlson, J., & Carlson, R. (2016). 101 careers in education. New York: Springer. Kuther, T., & Morgan, R. (2012). Careers in psychology: Opportunities in a changing world (4th ed.). Belmont, CA: Wadsworth. Sternberg, R. J. (2017). Career paths in psychology: Where your degree can take you (3rd ed.). Washington, DC: American Psychological Association. Vick, J. M., Furlong, J. S., & Lurie, R. (2016). The academic job search handbook (5th ed.). Philadelphia: University of Pennsylvania Press.

CHAPTER 2

Research Trinity DESIGN, MEASUREMENT, AND ANALYSIS

The details are not the details, the details make the product. —Charles and Bernice “R ay” Eames (quoted in Demetrios, 2013, p. 83)

T

he aim of this chapter is to promote better understanding of the connections among the trinity of research: design, measurement, and analysis—­that is, the details of a scientific study. Emphasized next is the integration of these elements, including (1) how they combine and complement one another to form the logical basis of empirical studies and (2) how each is concerned with a particular type of validity regarding the accuracy of inferences. An integrative perspective contrasts with a fragmentary one where each element is taught in separate courses, such as statistics in one class, research methods in a second course, and measurement in a third. But a piecemeal approach to these topics may do little to foster a sense of how design, measurement, and analysis each gives context and meaning to the others.

Overview Knowledge is a process of piling up facts; wisdom lies in their simplification. This adage, attributed to the physician and professor Martin Henry Fischer (1879–1962), sets the tone here. You already know lots of facts about research methods and statistics—­maybe even piles of them—but 14



Research Trinity 15

perhaps less so about measurement. Let us now arrange those facts in a more cohesive way that I hope ends up closer to the ideal of wisdom. We do so first by reviewing familiar but important concepts about research design. Later presentations about measurement and analysis may be less familiar to you, but this process is part of building a deeper understanding about research. We begin our review by defining the general concept of validity in research. Valid research is sound in that (1) it meets the requirements of scientific methods in a particular area and (2) conclusions based on the results are accurate or trustworthy. There are more specific aspects of validity—­including internal, external, construct, and conclusion validity—­ and they all concern the accuracy of specific kinds of inferences about the results. Presented in Figure 2.1 is a schematic that represents design, measurement, and analysis along with their respective associations with validity. A theme running throughout the discussion that follows is that a flaw in any single area, such as measurement, negatively affects the integrity and meaningfulness of the results, even if the other two areas—both design and analysis in this example—­a re without serious flaws.

Design Attributes of research design affect two different types of validity, internal and external (Figure 2.1), each of which is described next.

Internal Validity Internal validity refers to the approximate truth of inferences about causal effects—­that is, whether a presumed causal relation is properly demonstrated. For example, suppose that an independent variable or intervention is associated with changes on the dependent variable. The degree to which we can interpret this association as due to a causal effect is the central question of internal validity. The phrase “approximate truth” is emphasized because judgments about internal validity are usually not absolute in a particular study. Shadish, Cook, and Campbell (2002) used the term local molar causal validity when referring to internal validity. This alternative phrase underscores the points that (1) any causal conclusion may be limited to the samples, settings, treatments, and outcomes in a particular study (local) and (2) treatment programs or interventions are

16

P r o m ise a nd P r o b l e m s Structural elements Internal Extraneous variables validity Score independence

External Population definition validity Sampling plan Proximal similarity

Design

Measurement

Operational definitions Construct Score precision (reliability) validity Interpretation (validity)

Analysis

Point estimation Conclusion Interval estimation validity Hypothesis evaluation

FIGURE 2.1. Essential characteristics and the major type(s) of validity addressed by design, measurement, and analysis.

often complex packages of different components, all of which are simultaneously tested (molar). Three general conditions must be met before one can reasonably infer a cause–­effect relation (Cook & Campbell, 1979; Pearl, Glymour, & Jewell, 2016, describe these requirements from a perspective based on causal diagrams): 1. Temporal precedence: The presumed cause must occur before the presumed effect. 2. Association: There is an observed association (covariation); that is, variation in the presumed cause must be related to variation in the presumed effect. 3. Isolation of the causal relation: If there are other, unmeasured causes of the same effect, it is assumed that those unmeasured causes are unrelated to the measured cause. That is, there are no other plausible alternative explanations, such as that of the effects of extraneous variables, for the association between the presumed cause and the presumed effect.



Research Trinity 17

4. Replication: The requirements for temporal precedence, association, and isolation all hold over replications in representative samples. Temporal precedence is established by measuring a putative cause before variables are presumed to reflect its effects. In treatment outcome studies, for example, the intervention begins (and perhaps ends, too) before outcome is measured, and in longitudinal studies with repeated observations of the same cases, causal effects can also be estimated over time. Another example is mediation analysis, where it is assumed that a causal variable indirectly affects an outcome through an intervening variable, or a mediator (Kline, 2016). Measurement of the presumed cause, mediator, and outcome at different points in time in the order just listed would bolster the hypothesis of mediation. But if all variables are simultaneously measured, which is true in cross-­sectional studies, then it cannot be established which of the two variables, a presumed cause and its supposed effect, occurred first. This is why it is so challenging to infer causation in designs with no temporal precedence. The condition about isolation, or the absence of plausible alternative explanations, is sometimes the most difficult. This is true because it is basically impossible in a single study to control all extraneous variables that could potentially bias estimates of causal effects. Some data analysis techniques, such as regression analysis, accommodate the inclusion of potential extraneous variables as covariates, which statistically controls for their effects, an idea that is elaborated later in this chapter. Finally, evidence for causation is usually established over a series of studies, that is, replication is key for internal validity, which is equally true in the behavioral sciences and the natural sciences alike. Extraneous variables are uncontrolled variables that affect outcomes apart from measured causal variables in a particular study. There are two kinds, nuisance variables and confounding variables. Nuisance (noise) variables introduce irrelevant or error variance, which both reduces measurement precision and lowers confidence in any interpretation of the data. For example, testing grade 1 students in chilly, noisy rooms may yield imprecise or inaccurate scores on a reading skills test. The administration of a good measure, but by poorly trained examiners, could also reduce score quality. Nuisance variables are controlled by specifying proper testing environments, apparatus, procedures, measures, and examiner qualifications.

18

P r o m ise a nd P r o b l e m s

Confounding variables—also called lurking variables or just confounders—are the other kind of extraneous variable. Two variables are confounded if their effects on the same outcome cannot be distinguished. Suppose that parents are excited about the enrollment of their children in a new reading program at school. Unbeknownst to the researcher, parents respond by spending even more time reading at home. These same children read appreciably better at the end of the school program, but we do not know whether this result was due to the program or extra at-home reading. This is because the researcher did not think in advance to measure at-home reading as a potential confounder. The strength of causal inferences is determined in part by whether the structural elements of design in a particular study adequately deal with extraneous variables. These design elements are listed next (Trochim, Donnelly, & Arora, 2016): 1. Samples (groups). 2. Conditions (e.g., treatment or control). 3. Method of assignment to groups or conditions (i.e., random or otherwise). 4. Observations (i.e., the data). 5. Time, or the schedule for measurement, including when treatment begins or ends. The combination of the five design elements specified by the researcher sets the basic conditions for evaluating the hypotheses. This combination is often far from ideal. That is, few, if any, real-world researchers can actually measure all relevant variables in large, representative samples tested under all pertinent conditions. Instead, researchers must typically work with designs given constraints on resources (i.e., time, money). Accordingly, the challenge is to specify the best possible design, given such limitations, while respecting the hypotheses. Trochim and Land (1982) noted that a best possible design is: 1. Theory-­g rounded because theoretical expectations are directly represented in the design. 2. Situational in that the design reflects the actual setting(s) of the investigation.

Research Trinity 19



3. Feasible in that the sequence and timing of events, such as measurement, are carefully planned. 4. Redundant because the design allows for the flexibility to deal with unexpected problems without invalidating the entire study (e.g., losing an outcome measure is tolerable). 5. Efficient in that the overall design is as simple as possible, given the goals of the study. Design must also generally guarantee the independence of the observations (Figure 2.1), which means that the score of one case does not influence the score of another. For instance, if one student copies the work of another during a group-­administered test, their scores are not independent. It is assumed in many standard statistical techniques, such as the analysis of variance (ANOVA), that the scores are independent. This assumption is critical because results of the analysis can be very inaccurate if the scores are not independent. This is especially true concerning the outcomes of significance tests, or p values. Also, there is no simple statistical fix for lack of independence. Therefore, the independence requirement is generally met through design and measurement, not analysis.

External Validity Design also determines external validity, or whether results from a study are generalizable over variations in cases, settings, times, outcomes (measures), and treatments (Figure 2.1). Population validity is a facet of external validity. It concerns variation over people and, specifically, whether one can generalize from sample results to a defined population. Another is ecological validity, or whether the combined treatments, settings, or outcomes in a particular study approximate those of the real-life situation under investigation (Shadish et al., 2002). In a clinical trial, for example, both kinds of validity just mentioned are augmented through study of representative patients in actual clinics where treatment is to be delivered. Suppose that results for a new treatment look promising. It is natural to wonder, would the treatment be just as effective in other samples drawn from the same population but tested in other settings or with reasonable variations in treatment delivery or outcome measures? With sufficient replication, we will eventually have the answers, but replication

20

P r o m ise a nd P r o b l e m s

takes time. Is there a way to “build in” to the original study some reassurance (but not a guarantee) that the results may generalize? Yes, it is achieved through representative sampling of persons (cases) as well as settings, treatments, and outcomes (see Figure 2.1). The sampling plan concerns (1) definition of the target population and (2) use of a method to obtain a representative sample. One such method is probability sampling, where observations are selected by a chance-­based method. Different kinds of probability sampling are as follows: 1. In simple random sampling, all observations in a population have an equal probability of appearing in the sample. 2. In stratified sampling, the population is divided into homogeneous, mutually exclusive groups (strata), such as neighborhoods, and then observations are randomly selected from within each stratum. Normative samples of psychological tests are often stratified based on combinations of variables such as age, gender, or other demographic characteristics. 3. In cluster sampling, the population is also divided into groups (clusters), but then only some clusters are randomly selected to represent the population. All observations within each cluster are included in the sample, but no cases from the unselected clusters are included. A benefit of cluster sampling is that costs are reduced by studying some, but not all, clusters. Probability sampling is part of the population inference model in which population parameters are estimated with random sample statistics. Probability sampling bolsters external validity because statistics averaged over large random samples tend to closely approximate the corresponding parameter. Thus, replication is critical to estimation in this model, too. Huck (2016) described the myth that a single application of random sampling yields a sample that is a miniature version of the population with values of statistics that essentially equal those of the corresponding parameters. This myth ignores sampling error, through which results in just about any particular random sample will not exactly mirror those in the population. Probability sampling is more a theoretical ideal than reality in the behavioral sciences; specifically, few researchers actually study random samples. This is especially true in human studies where nonprobability



Research Trinity 21

sampling is dominant. There are two general types, accidental and purposive. In accidental sampling, cases are selected because they happen to be available. Such samples are also called ad hoc samples, convenience samples, or locally available samples. A group of patients in a particular clinic who volunteer as research participants is an example of an ad hoc sample. So is a sample of undergraduate psychology majors at a particular university who volunteer to take part in a study. When researchers study purely convenience samples, the design really has no sampling plan at all. A big problem with such samples is that they may not be representative. For example, research conducted with volunteers may be subject to volunteer bias, which refers to systematic differences in motivation, conscientiousness, openness, health, or other variables between volunteers and nonvolunteers (Bogaert, 1996). The greater the degree of volunteer bias, the less generalizable are the results to those who chose not to volunteer. Volunteers can also vary appreciably over settings. For example, patients who volunteer for a clinical trial in one hospital may be sicker or healthier than patients who volunteer for the same trial but are recruited in a different hospital. One way to mitigate potential bias due to selection factors is to measure a posteriori a variety of sample characteristics and report them along with the rest of the results. This allows readers of the work to compare its sample with those of related studies in the area. In the technique of meta-­ analysis, whether results from a set of primary studies depend on sample characteristics, such as age or gender, is often evaluated, but sufficient numbers of studies are required before meta-­analysis is possible. In purposive sampling, the researcher intentionally selects cases from defined groups or dimensions. If so, then sampling is part of the study design, and groups or dimensions according to which cases are selected are typically linked to the research hypotheses. For example, a researcher who wishes to evaluate whether the effectiveness of a new medication differs by gender would deliberately select both men and women patients with the same illness. After collecting the data, gender would be represented as a factor in the analysis along with the distinction of medication versus placebo. Used in this way, purposive sampling may facilitate generalization of the results to both men and women. But if the men and women come from a larger convenience sample, then this potential benefit of purposive sampling may be compromised, if that larger sample is itself not representative.

22

P r o m ise a nd P r o b l e m s

Principle of Proximal Similarity and Implications An even broader perspective on external validity comes from the principle of proximal similarity in which characteristics of both samples and how participants in those samples were recruited are considered (Brunner, 1987). That is, persons, settings, and procedures are all considered as factors for evaluating generalizability. For example, Bernhardt and colleagues (2015) evaluated the generalizability of results from a large, multicenter randomized clinical trial where treatment consisted of intensive physical therapy starting within 24 hours of a stroke and continued 6 days a week for 2 weeks or until discharge from the hospital. Although the treatment seemed promising, the researchers found that (1) women patients were less likely to be recruited due to greater premorbid disability and late arrival to the hospital (> 24 hours) after a stroke and (2) older patients and those with more severe strokes were also less likely to be recruited. Thus, generalizability may be limited by the patient and illness factors just mentioned. An implication of the proximal similarity principle is that characteristics of treatment can combine with those of cases, settings, or procedures in ways that constrain external generalizability. Examples of such combinatorial validity threats are listed next (Shadish et al., 2002): 1. A treatment × unit interaction is indicated when an effect holds only for certain types of units or cases, such as when a treatment works well for women but not for men, or vice versa. 2. A treatment × setting interaction occurs when treatment is more effective in one setting than in another, such as private practice versus public health clinics. 3. If a treatment has different effects on different outcomes, then a treatment × outcome interaction is indicated. Suppose that a reading comprehension program has effects on decoding, or the ability to sound out words, but not on fluency, or the rate at which text is read and understood. The only way to detect this pattern is to use measures of these different outcomes. If either decoding or fluency is measured (but not both), then the pattern of differential effects just described would not be obvious. Thus, the way to avoid this particular validity threat is to measure multiple outcomes for the same intervention. 4. A treatment × treatment interaction occurs when treatment



Research Trinity 23

effects either (a) do not hold over variations in treatment or (b) depend on exposure to previous or ongoing treatments. An example of the fourth threat just listed is multiple treatment interference, which occurs when the effects of a treatment under study are confounded with those of a different treatment, such as when patients are taking multiple medications in addition to an experimental drug that is the focus of a clinical trial. In this example, it may be difficult to estimate the effects of the experimental drug apart from those of other medications. There are special designs for studying the effects of multiple treatments, but they generally require that each treatment is administered separately from all others, with sufficient delay between them so that lingering effects of the first treatment do not confound the effects of the second treatment.

Measurement The role of measurement in empirical studies is critical, no less so than that of design and analysis. It serves the essential purposes indicated in Figure 2.1, including the provision of (1) operational definitions for variables of interest and (2) methods for evaluating the precision and accuracy of interpretation—­respectively, reliability and validity—­of scores from measures based on those definitions. In human studies, variables of interest often correspond to hypothetical constructs that are not directly observable. An example is quality of life, which requires a definition of just what is meant by life quality and specification of methods or operations that permit quantification of the construct. The operational definition permits the generation of scores (i.e., data), which are the input for the analysis. Those scores should be both reliable, or relatively free from measurement error, and valid, or able to be interpreted in a particular way that meets the research aim. These characteristics refer to construct validity, which falls within the domain of measurement. Analysis of scores that are neither reliable nor valid may yield no useful information. As noted by Pedhazur and Schmelkin (1991), “Unfortunately, many readers and researchers fail to recognize that no matter how profound the theoretical foundations, how sophisticated the design, and how elegant the analytic techniques, they cannot compensate for poor

24

P r o m ise a nd P r o b l e m s

measures” (p. 3). Measurement is too often neglected in behavioral science research. Chapter 8 deals with psychometrics, which are statistical measures of the properties of scores from psychological tests, including their reliability and validity. How to evaluate the psychometrics of a given test and what to consider when selecting measures among alternatives are discussed in that chapter. Treatment outcome studies often have special measurement features that guard against threats to construct validity. Reactive self-­report changes, or reactivity, refers to a type of demand characteristic where participants’ motivation to be in treatment or their guesses about the nature of the study affects their responses such that a treatment effect is imitated. Reactivity can be reduced by measuring outcome outside a laboratory setting, using unobtrusive measures, avoiding the use of pretests that give hints about expected outcome, or applying masking (blinding) procedures that try to prevent participants from learning research hypotheses (Webb, Campbell, Schwartz, & Sechrest, 1966). Use of masking to prevent examiners or raters from knowing the hypotheses may also reduce the threat of researcher expectancies, where anticipations about desirable responses are conveyed to participants, perhaps unintentionally so. This is why double-­blind procedures are routinely used in clinical trials: They control for expectancy effects on the part of both participants and researchers. In studies with a single operationalization of each construct, there is the possibility of mono-­operation bias. An example is when a single version of a treatment is studied. A particular version may include just one exemplar of a manipulation, such as when just a single dosage of a drug is studied. The use of a single outcome measure when treatment effects are complex and occur over multiple domains is another example. This is why measures of multiple outcomes are typically employed in clinical trials. Mono-­method bias can occur when different outcome measures are all based on the same measurement method, such as self-­report, or the same informant, such as parents in studies of children. There may be effects due to that particular method or informant known as common method variance that can confound the assessment of target constructs. Accordingly, it is best to use different measurement methods or informants across a set of outcome measures. Special techniques that estimate the degree of common method variance in a set of measures are described in Chapter 8, about psychometrics.



Research Trinity 25

Analysis A primary concern in data analysis is conclusion validity, which involves the use of appropriate statistical methods to estimate relations between variables of interest in order to evaluate the hypotheses (Figure 2.1). There are two basic kinds of estimation: point estimation and interval estimation. In point estimation, the value of a sample statistic is used to estimate a population parameter. For example, a sample mean, designated here as M, estimates the mean of the corresponding population, or m. Because values of statistics are affected by sampling error, their values almost never exactly equal that of the parameter. Thus, M ≠ m in most samples drawn from the same population. You will learn later that effect sizes, not p values from tests of statistical significance, are among the most interesting and relevant point estimates to report. Approximation of the degree of sampling error associated with a point estimate refers to interval estimation, and it involves the construction of a confidence interval about a point estimate. A confidence interval is a range of values that might include that of the population parameter within a specified margin of error. It also expresses a range of plausible values for the parameter, given the data in a particular sample. In graphical displays, confidence intervals are often represented as error bars depicted as lines extending above and below—or to the left and right, depending on graph orientation—­a round a point estimate. Carl Sagan (1996) called error bars a “quiet but insistent reminder that no knowledge is complete or perfect” (pp. 27–28), a fitting description. Confidence intervals are more formally defined in Chapter 6, about statistics reform. You may be surprised to see that conducting significance tests is not the primary goal of the analysis. This is because it is quite possible to evaluate hypotheses without conducting significance tests at all. This is done in the natural sciences all the time, and recent events, including bans on significance testing in certain journals, point to a diminishing role for classical significance testing in the behavioral sciences. In some older sources, conclusion validity is defined as whether inferences about the null hypothesis based on results of significance tests are correct, but this view is too narrow, for the reasons just stated. Thus, overreliance on significance testing is a threat to conclusion validity. More modern perspectives and alternatives in the analysis are described in later chapters. Other threats to conclusion validity include violated assumptions of statistical techniques, such as the normality, homogeneity of variance,

26

P r o m ise a nd P r o b l e m s

or independence assumptions in the ANOVA. If distributional or other assumptions are violated, then the results, including p values in significance testing, could be very wrong. Low statistical power also adversely affects conclusion validity in significance testing. In treatment outcome studies, power is the probability of finding a significant treatment effect when there is a real effect of treatment in the population. Power varies directly with the magnitude of the real effect and sample size. Other factors that affect power include the criterion level of statistical significance (e.g., .05 vs. .01), the directionality of the alternative hypothesis (i.e., oneor two-­tailed), whether the design is between-­subjects or within-­subjects, the particular test statistic used, and the reliability of the scores. The following combination leads to the greatest power: a large population effect size, a large sample, the .05 level of statistical significance, a one-­tailed test, a within-­subjects design, a parametric test statistic (e.g., t) rather than a nonparametric test statistic (e.g., Mann–­W hitney U), and highly reliable scores. We see in Chapter 6 that power is generally low in the behavioral sciences, but sometimes there is little that researchers can do to substantially increase power. This is another problem when we rely too much on significance testing. Other threats to conclusion validity come from the failure to recognize artifacts, or results due to anomalous sampling, data, measurement, or analysis features or practices. One example is range restriction, which can happen when either sampling (case selection) or attrition (missing data) results in reduced inherent variability on X or Y relative to the population. For example, suppose there is generally a positive linear relation between weight loss and systolic blood pressure among adults. However, the observed correlation between these two variables in a sample of obese adults only may be close to zero due to range restriction in weight. The same outcome would be expected if only hypertensive adults are selected for the sample due to range restriction in systolic blood pressure; see Huck (1992) for more information about range restriction. Another artifact is regression to the mean where scores that are extremely higher or lower than the mean will likely be closer to the mean if cases are measured a second time. Examples of how this phenomenon can mimic or disguise treatment effects are given later in this chapter. No statistical analysis can “fix” a bad research design or poorly measured data. This means that the general expression about computer use, or “garbage in, garbage out,” certainly applies to the analysis phase in research. Again, design, measurement, and analysis are components of an



Research Trinity 27

integrated whole, and flaws in any individual part detract from the quality of the research and limit its potential contribution. Considered next are characteristics of particular kinds of research designs, with emphasis on potential threats to internal validity. Most examples concern the comparison of treatment and control groups, but the same basic points also apply to the comparison of other kinds of groups or conditions. Special notation for describing structural design elements is also introduced. Because there are so many different types of designs in actual studies, it is impossible to review them all. Instead, the goal is to highlight issues and problems common to many, but not all, empirical studies.

Randomized Designs In experimental designs, the researcher manipulates the independent variable(s) and then measures the effect(s) on the dependent variable(s). In the natural sciences, it may be possible to directly manipulate variables, such as by decreasing or increasing pressure, or the amount of force applied perpendicular to the surface of an object per unit area, using a hydraulic press. In the behavioral sciences, independent variables are typically manipulated through random assignment, also called random allocation or just randomization. In treatment outcome studies, for example, cases are randomly assigned to either a treatment group or a control group. Studies in which randomization is used in this way are called randomized experiments. In medicine, such designs are called randomized control trials or randomized clinical trials. Randomized designs are considered a gold standard for causal inference, and here’s why: Randomization ensures that each case has an equal chance of being placed in either group, treatment or control. Thus, group membership has nothing to do with characteristics of cases that may confound the treatment effect. For instance, if treated cases are less ill than control cases, any treatment effect is potentially confounded with this initial difference. Randomization tends to equate groups on all variables before treatment, whether or not these variables are directly measured (Shadish et al., 2002). In the example just mentioned, randomization would tend to evenly distribute levels of illness severity and all other individual differences across the two conditions. It is this property of randomization that strengthens internal validity.

28

P r o m ise a nd P r o b l e m s

You should know that randomization equates groups in the long run (i.e., on expectations), and it works best when the overall sample size (N) is reasonably large, such as > 100 when there are two conditions (Lachin, Matts, & Wei, 1988). There is no guarantee that groups formed through randomization in a particular study will be exactly equal on all measured or unmeasured variables, especially when N is not large. But randomization will equate groups on average across independent replications of the study. Sometimes it happens that groups formed through randomization are clearly unequal on some variable before treatment is administered. The expression failure of randomization is used to describe this situation, but it is a misnomer because it assumes that random assignment should guarantee equivalent groups every time it is used. There are actually a few different types of randomization. Some major forms are described next; see Suresh (2011) for more information: 1. Simple randomization is based on generating a single sequence of random assignments, such as by flipping a coin for each case. In large samples, simple randomization tends to yield groups of roughly equal size, but group sizes can be quite unbalanced in small samples. 2. Block randomization results in equal group sizes, or nearly so. Cases are randomly divided into blocks, such as 2–4 cases per block, and then the blocks are randomly allocated to conditions. The sample size divided by the block size should have no remainder. Given N = 50, for example, a block size of 2 is fine (50/2 = 25), but not a block size of 4 (50/4 = 12.5). A problem is that forcing equal group size can result in bias compared with simple randomization. Thus, it is critical to measure potential confounding variables, or covariates, in the application of block randomization. 3. Stratified randomization is aimed at balancing groups in terms of covariates. It generates a separate block of cases for each combination of covariates, such as all combinations of gender by marital status; then each block of cases all equal on the covariates are randomly assigned to conditions. 4. Cluster randomization, also known as group randomization or place randomization, involves the random assignment of larger groups to conditions. This type of randomization is used in cluster (group, place) randomized designs in which interventions

Research Trinity 29



cannot be targeted at individuals. An example is a public heath advertising campaign directed toward a community through various public or social media. There are special challenges and statistical techniques in cluster randomized designs that are beyond the scope of this chapter to describe, but see Hayes and Moulton (2017) for more information. There is a more or less standard notational set for design elements where R refers to random assignment, X represents the exposure of a group to treatment, O designates an observation or measurement, and each line of symbols corresponds to a separate group. Basic kinds of randomized designs are represented in Table 2.1 using this notation. For example, a simple randomized design has two groups (treatment, control) and posttest assessment of all cases. It can be extended by adding more treatment conditions, such as different dosages of the same drug, or more control conditions. Examples of the latter include a placebo control group that receives all the trappings of treatment except the presumed active ingredient of treatment (e.g., an inert pill or tablet in a drug study) and a waitlist control group where patients are seen for an initial assessment. No treatment is given, but they are promised treatment when they return at a later date. Johnson and Besselsen (2002) and Lindquist,

TABLE 2.1.  Notation for Basic Types of Randomized Designs Type Simple

Representation R

X

R Pretest–posttest

Solomon four group

O

R

O1

R

O1

R

O1

R

O1

R R

O

X

O2 O2

X

O2 O2

X

O2 O2

Note. R, random assignment; X, treatment; O, observation (assessment).

30

P r o m ise a nd P r o b l e m s

Wyman, Talley, Findorff, and Gross (2007) described additional kinds of control groups for human or animal studies. In a randomized pretest–­posttest design, a pretest (O1) is administered to all cases after randomization but before the administration of treatment.1 A posttest (outcome measure, O2) is given to all cases at the end of the study (see Table 2.1). The availability of the pretest helps to mitigate the risk of case attrition, or drop out, from the study; specifically, it can be determined whether participants who left the study early differed from those who remained. Comparison of pretest with posttest scores also permits estimation of the amount of change in each of the treatment and control groups. There are a few options for the pretest: 1. A repeated pretest is identical to the posttest. This is because all cases are tested twice, using the same measure. This type of pretest offers a relatively precise way to index group differences both before and after the administration of treatment, but pretest sensitization—­defined momentarily—­may be a concern with the use of a repeated pretest. 2. A proxy pretest is a variable that should predict the posttest but is not identical to it. Such a pretest can be a demographic variable or, even better, a psychological variable that is conceptually related to the outcome measured by the posttest. It can also be an archival variable that is collected after the start of the study. For example, in a study of reading outcomes among grade 2 students, a letter recognition task is administered at the beginning of the school year. Not only should the two measures (pretest, posttest) be related, but not all children can read proficiently at the beginning of grade 2. 3. A retrospective pretest is administered at the same time as the posttest, and it asks participants to describe their status before treatment. For example, at the conclusion of a work skills seminar, participants may be asked to rate what they learned (posttest) and at the same time describe their level of prior knowledge (retrospective pretest). But such ratings are subject to motivational biases, such as the desire to please the instructor or to justify time spent in the seminar. This is why Hill and Betz (2005) argued that 1 A

variation is to administer the pretest before randomization. In this case, the position of the R in the design notation would accordingly vary.



Research Trinity 31

retrospective pretests are better for evaluating subjective experiences of posttreatment change than for estimating actual treatment effects. Pretest sensitization or testing effects is a kind of internal validity threat where the administration of a pretest affects scores on the posttest apart from any treatment effect and is a potential drawback of the randomized pretest–­posttest design. Such effects are directly estimated in a Solomon four-group design. The basic layout consists of two treatment groups and two control groups. One treatment group and one control group are each administered the pretest but not the other two groups; see Table 2.1 for the design schematic. If, say, the two treatment groups differ appreciably at posttest, then there may be a testing effect for the treated cases. A testing effect for the untreated cases is likewise indicated if the two control groups differ appreciably at posttest. If the two testing effects just described are not similar, then the effect of giving a pretest is different for treated versus untreated cases. For example, Chang and colleagues (2014) used a Solomon four-group design in a randomized, multicenter study of online courses for pediatric emergency medicine residents and students. Pretest sensitization effects were essentially nil, and posttest knowledge scores were higher among participants enrolled in the online course. The randomized designs outlined in Table 2.1 all pose the ethical dilemma that treatment is withheld from vulnerable or ill persons in the control group. A recent example described by Ziliak and Teather-­Posadas (2016) is the denial of corrective eyewear to thousands of Chinese schoolchildren with demonstrably poor vision in a randomized trial on whether prescription eyeglasses improve school performance. There are other examples, some even horrific, including the Tuskegee Syphilis Experiment (1931–1971), in which government doctors knowingly did not treat hundreds of syphilitic African American men in order to form a control group to be compared against penicillin treatment, which was known at the time as an effective intervention for syphilis (Gray, 1998). Treatment is not withheld in the two special designs outlined next where each case serves as its own control. In a basic switching-­replications design, a pretest is administered to all cases, and then treatment is introduced in the first group only. Next, a posttest is administered to all cases, and then the second group is given the treatment. Finally, observations on a second posttest are collected from all cases. In this way, the

32

P r o m ise a nd P r o b l e m s

implementation of treatment is replicated, and the two groups switch roles—from treatment to control and vice versa—when the treatment is repeated. At the end of the study, both groups have been treated but in a different order. Because treatment is given at any one time to just one group, resources are saved compared with treating all cases at once. An exercise asks you to specify the notation for a switching-­replications design. Stuss and colleagues (2007) employed such a design to evaluate a 3-month cognitive rehabilitation program within a sample of healthy adults who were 77–89 years old. Treatment-­related gains were observed for both groups, and these gains persisted at a 6-month follow-­up. In a basic crossover design—also known as a switchback design or a reversal design—all cases receive two different treatments, X A and X B, but in a different sequential order. Each case is also measured three times—at pretest, a first posttest after administration of the first treatment, and then a second pretest after the second treatment. The design allows for the direct evaluation of order effects, or whether the sequence “X A then X B” predicts a different outcome than the sequence “X B then X A.” Carryover effects, where the first treatment affects the response to a second treatment, are a special concern. One way to prevent carryover is to include a washout period between treatments, during which the effects of the first treatment should dissipate before beginning the second. An exercise asks you to specify the notation for this design. In a randomized crossover design, Kulczycki, Kim, Duerr, Jamieson, and Macaluso (2004) evaluated user satisfaction for male condoms versus female condoms in a sample of urban women in stable relationships. Both the women and their partners expressed greater satisfaction with male condoms, although both types had relatively low acceptability. There are other kinds of experimental designs. One is the randomized factorial design, where cases are assigned to combinations of two or more independent variables (factors), such as two different types of drugs. Factorial designs allow estimation of interaction effects, where the influence of a particular factor on the dependent variable changes across the levels of other factors. Interaction effects are always conditional. For example, if the relative effects of drug 1 versus drug 2 are different for men versus women, there is a drug × gender interaction. This means that the strength or magnitude of the drug effect depends on gender. For the same example, it is also true that gender differences depend on which drug is administered; that is, interaction effects are always symmetrical. An exercise asks you to specify the notation for a basic factorial design.



Research Trinity 33

Controlled Quasi‑Experimental Designs A controlled quasi-­ experimental design has separate treatment and control groups, but there is no randomization as in a true experiment. Instead, cases are assigned to conditions by some other method in which the researcher usually plays no role. Suppose that introductory statistics will be taught in two different sections. The same material is covered in both sections, and the exams are the same, too, but one section is a traditional lecture course and the other is an online version. The first 100 students who register for statistics are assigned to the online course, and the next 100 are placed in the lecture course. This method of assignment may result in groups that are different before the course even begins. For example, students keenly interested in statistics may register early, while those who detest statistics may procrastinate and register later. Thus, students in the online course may perform better simply because of their greater interest in the subject rather than the instructional format per se (i.e., volunteer bias). In a nonequivalent-­g roups design, the treatment and control groups are intact, or already formed. These groups may be self-­selected, as in the example just considered, or formed by another means beyond the control of the researcher. Ideally, (1) the two groups should be as similar as possible and (2) the choice of group that receives the treatment is made at random, but these preferred conditions may not exist in practice. Instead, intact groups may differ on any number of variables confounded with treatment effects, which is known as selection bias, a pervasive type of validity threat in designs with nonequivalent groups. Volunteer bias is a particular kind of selection bias. It may be possible to measure some, but not all, confounders. This is because all the ways in which nonequivalent groups differ prior to treatment that also affect the outcome variable are rarely known beforehand. This lack of information is why causal inference is generally so much more difficult in quasi-­experimental designs than in randomized experiments (Shadish et al., 2002). The risk that treatment and control groups in quasi-­experimental designs are inherently unequal prior to treatment can be mitigated by (1) attempting to measure potential confounding variables and (2) controlling for effects of confounders in the analysis when estimating treatment effects. This strategy corresponds to statistical control, where pretests are used to measure possible confounding variables. Next, scores from those pretests are treated in the analysis as covariates, which are specified as

P r o m ise a nd P r o b l e m s

34

predictors of outcome along with the distinction between treatment and control (i.e., the independent variable). The estimate of the treatment effect is then statistically corrected, taking account of the covariates. Specific methods for statistical control are described later in this chapter, but it is worth saying now that there is no magical cure for the absence of random assignment in a nonequivalent-­g roups design. Indeed, it is very challenging to actually minimize bias when estimating treatment effects in such designs. Notations for structural elements in basic controlled quasi-­ experimental designs are presented in Table 2.2 where NR means nonrandom assignment. The absence of a pretest in the posttest-­only design makes it hard to disentangle treatment effects from preexisting group differences that are related to outcome. Accordingly, the internal validity of this design is very weak, and the observed mean contrast on the outcome measure(s) could be a biased estimator of the real treatment effect. An improvement is the pretest–­posttest design, where pretest data are collected either before treatment is administered (as is depicted in the table) or after treatment begins (if so, the position of O1 in the structural notation would accordingly vary). Possible types of subtests include proxy, repeated, or retrospective pretests. Of the three, a repeated pretest is best for gauging initial differences between the treatment and control groups. The internal validity of the pretest–­posttest design is still subject to some critical threats, even if the pretest and posttest are identical. One

TABLE 2.2.  Notation for Basic Types of Controlled Quasi-Experimental Designs Type Posttest only

Representation NR

X

NR Pretest–posttest

Double pretest

O O

X

O2

NR

O1

NR

O1

O2

NR

O1

O2

NR

O1

O2

X

O3 O3

Note. NR, nonrandom assignment; X, treatment; O, observation (assessment).



Research Trinity 35

is selection-­maturation bias, which refers to the possibility that the treatment and control groups are changing naturally in different ways that mimic a treatment effect. A related concern is selection-­history bias, or the possibility that events occurring between the pretest and posttest differentially affected the treatment and control groups, again in a way that confounds the treatment effect. Selection-­regression bias happens when there are different rates of regression to the mean across the groups. Suppose that treated cases are selected because they have the highest scores on a pretest of the number of illness symptoms (i.e., they are the sickest). Because of regression to the mean, we expect the treated cases to obtain less extreme scores when tested again, even apart from any beneficial effect of treatment. Shadish and colleagues (2002) described additional types of internal validity threats in various kinds of controlled quasi-­ experimental designs. In a double-­pretest design, all cases are tested on three occasions, twice before treatment and again after (see Table 2.2 for the design schematic). This design helps to reduce the threat of maturation. With two pretests, one can compare the pretreatment growth rates of both groups before commencing any treatment. For example, presented in Figure 2.2 are means for a treatment and control group on an outcome where a higher score is a better result. Figure 2.2(a) corresponds to a pretest–­ posttest design with a single pretest. The observed means indicate that both groups improved from the pretest to posttest, but the treated cases improved more. Still, the treatment group started out at a higher level, so is the result in Figure 2.2(a) due to the treatment or to maturation? Presented in Figures 2.2(b) and 2.2(c) are hypothetical results from a double-­pretest design. The pattern of means in Figure 2.2(b) suggests the absence of a treatment effect because the steeper growth rate of the treated cases continued despite treatment. But the pattern of means in Figure 2.2(c) is more consistent with attributing at least the posttest difference to treatment because the growth rates of the two groups were similar until treatment began. The internal validity of the double-­pretest design is still susceptible to other threats though, including history and testing effects (e.g., giving a pretest affects the posttest in different ways for the two groups). Other types of controlled quasi-­experimental designs with nonequivalent groups are counterparts to randomized designs. For example, it is possible to directly estimate the effect of administering a pretest on

P r o m ise a nd P r o b l e m s

36

(a) Pretest–posttest results

Outcome

Treatment Control

Pretest

Pretest 1

(c) Double-pretest results 2

Outcome

Outcome

(b) Double-pretest results 1

Posttest

Pretest 2

Posttest

Pretest 1

Pretest 2

Posttest

FIGURE 2.2.  Results from a pretest–­posttest design (a) and two alternative results from a double-­pretest design (b, c). In panels b and c, treatment is administered after the second pretest.

posttest scores in a Solomon four-group design where the groups are not formed with random assignment. The notation for the design just described would be the same as for the randomized Solomon four-group design in Table 2.1 except the code for group assignment would be “NR” (not random) instead of “R” (random). There are also quasi-­experimental versions of switching-­ replications designs and crossover designs. An example follows. Bottge, Rueda, LaRoque, Serlin, and Kwon (2007) evaluated a problem-­solving approach to teaching arithmetic to students with math-­ related learning disabilities. In a switching-­replications design, two intact classrooms, each at different schools, were assigned to one of two different instructional orders based on standard math instruction or problem-­ based instruction. Both treatment orders resulted in better math problem-­ solving skills but less so for calculation skills. Given these results, Bottge



Research Trinity 37

and colleagues suggested that it may not be necessary to delay the teaching of concepts for understanding until all related procedural skills are mastered. A regression-­discontinuity design is especially strong in terms of internal validity. This is because cases are assigned to treatment versus control groups based on scores from a special kind of pretest called an assignment variable, designated next as OA. The researcher determines a threshold or cutting score, C, on OA that determines group membership. Suppose that OA is an achievement test where C = 30. Students with scores ≥ 30 are enrolled in an enrichment program (treatment), while students with scores  30) for each effect, which may boost statistical power close to nearly 1.0 when results are aggregated. See also Camerer and colleagues (2018), who found that about 70% of social science results published in the journals Nature and Science from 2010 to 2015 were reproduced in high-­powered replications, with sample sizes about five times greater than in the original studies. Another bright spot in psychology concerns psychometrics, which has a relatively strong tradition of replication (Kline, 2013). This status may be due at least in part to professional benchmarks for those who develop psychological tests, such as Standards for Educational and Psychological Testing, developed jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (2014). The demonstration of score reliability and validity requires several types of evidence, including the use of multiple methods of assessment. In psychometrics, there is also appreciation of the need to cross-­validate tests that generate scores based on composites, or linear combinations, of scores from predictor variables. These weights are subject to sampling error; thus, it is important to determine whether their values replicate in other samples. Baker (2016) surveyed nearly 1,600 researchers in biology, medicine, physics, and engineering, and other areas about reproducibility in science. The respondents were readers of Nature and so perhaps also were more receptive to concerns about replication. Thus, they may not be representative of all researchers in these areas, but the results shed light on the replication problem: 1. About 70% of respondents tried and failed to replicate another scientist’s experiments, and about 50% failed to replicate some of their own experiments. 2. About 50% agreed that the replication crisis is “significant,” about 40% described the crisis as “slight,” and just 3% said there was no crisis.



Crises 59

3. Most respondents never tried to publish a replication study, but about 25% said they were able to publish a successful replication, and 13% had published a failed replication. 4. About two-­t hirds said that procedures to improve replication were in place in their respective laboratories, and about half of these respondents indicated that such procedures were new within the last five years. 5. More than 60% reported that pressure to publish and selective reporting always or often contribute to the replication problem. Other contributing factors endorsed by the majority include low statistical power, poor oversight of laboratory procedures, and the absence of routine replication in the original labs. You probably already know that “publish or perish”—the pressure to publish constantly in order to demonstrate scientific merit—is a reality for new, tenure-­track professors, especially in research-­oriented universities. Accordingly, junior faculty are sometimes counseled to break their research down into pieces and publish those pieces in multiple articles, which would boost the publication count if successful. A related idea is that of the least publishable unit (LPU), or the smallest amount of data that could generate a journal article. The term is used in a derogatory or sarcastic way to describe the pursuit of the greatest quantity of publications at the expense of quality. Now, it is true that publication in prestigious journals will usually benefit a tenure-­track professor, but the total quantity of published articles across journals at different levels of prestige is important, too. Emphasis on quantity may reduce study quality, which in turn makes it less likely that the results can be reproduced. Maybe certain kinds of results in the “soft” sciences like psychology are inherently less reproducible than findings in the “hard” sciences like chemistry or physics. This is because human behavior may be less subject to general laws or principles that apply to every case and that work the same way over time. Such immutable laws are described as nomothetic factors. In contrast, idiographic factors are specific to individual cases. They concern discrete or unique facts or events that vary across both cases and time. If human behavior is more determined by idiographic factors (e.g., experiences, environments) than by nomothetic factors (e.g., genetics, physiology), then there is less potential for prediction and replication, too (Lykken, 1991). Along similar lines, Cesario (2014) suggested that priming effects and demand characteristics within each study could

60

P r o m ise a nd P r o b l e m s

be so specific to participant or task characteristics that they could be eliminated or even reversed over studies. That is, it may be unreasonable to expect reproducibility, especially without strong theory about individual differences in this area. Simons, Shoda, and Lindsay (2017) recommended that a statement about constraints on generality (COG) should be a part of the Discussion section in all empirical studies. The COG statement would explicitly define the target population and explain why the sample, measures, or procedures in the study are representative of those in the larger population. One aim is to limit unwarranted conclusions about external validity, and a second goal is to support replication through explicit definitions of the target population. Simons and colleagues offered examples of COG statements for three actual empirical studies.

Significance Testing Crisis The geopolitical forecaster George Friedman (2009) wrote, “It is simply that the things that appear to be permanent and dominant at any given moment in history can change with stunning rapidity. Eras come and go” (p. 3). That is, just because a thing is seen everywhere at a particular point in time does not mean that it will never disappear. A related idea is that of “too big to fail,” or the colloquial expression for something so dominant and interconnected that its downfall would be disastrous, and thus must be prevented. A risk of this viewpoint is the moral hazard that a protected entity assumes unwarranted risks, the costs of which are borne by others in case of failure. There are parallels between these concepts and significance testing, as explained next. Since the 1960s—but not earlier (Hubbard & Ryan, 2000)—significance testing has been seen everywhere in the behavioral sciences: It is featured in almost all journal articles for empirical studies, and significant results designated with asterisks—­t he standard symbol for findings where p < a, the criterion level of statistical significance (e.g., “*” for p