The Oxford Handbook of Metamemory 9780199336746, 0199336741

1,087 184 35MB

English Pages [593] Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Oxford Handbook of Metamemory
 9780199336746, 0199336741

Table of contents :
Cover
Series
The Oxford Handbook of Metamemory
Copyright
Short Contents
Oxford Library of Psychology
About the Editors
Contributors
Contents
Prologue: Some Metacomments on Metamemory
Part One Introduction to Metamemory
1. A Brief History of Metamemory Research and Handbook Overview
2. Methodology for Investigating Human Metamemory: Problems and Pitfalls
3. Internal Mapping and Its Impact on Measures of Absolute and Relative Metacognitive Accuracy
Part Two Metamemory Monitoring: Classical Judgments
4. Judgments of Learning: Methods, Data, and Theory
5. Introspecting on the Elusive: The Uncanny State of the Feeling of Knowing
6. Tip-of-the-Tongue States, Déjà Vu Experiences, and Other Odd Metacognitive Experiences
7. Sources of Bias in Judgment and Decision Making
8. The Self-Consistency Theory of Subjective Confidence
9. Metacognitive Aspects of Source Monitoring
Part Three Metamemory Monitoring: Special Issues
10. Monitoring and Regulation of Accuracy in Eyewitness Memory: Time to Get Some Control
11. Metamemory and Education
12. Prospective Memory: A Framework for Research on Metaintentions
13. Metamemory and Affect
14. Metamemory in Comparative Context
15. Looking Back and Forward on Hindsight Bias
Part Four Control of Memory
16. The Metacognitive Foundations of Effective Remembering
17. Self-Regulated Learning: An Overview of Theory and Data
18. The Need for Metaforgetting: Insights from Directed Forgetting
19. Metacognitive Quality-Control Processes in Memory Retrieval and Reporting
20. Three Pillars of False Memory Prevention: Orientation, Evaluation, and Corroboration
Part Five Neurocognition of Metamemory
21. The Ghost in the Machine: Self-Reflective Consciousness and the Neuroscience of Metacognition
22. The Cognitive Neuroscience of Source Monitoring
23. Anosognosia and Metacognition in Alzheimer’s Disease: Insights from Experimental Psychology
24. Metamemory in Psychopathology
Part Six Development of Metamemory
25. The Development of Metacognitive Knowledge in Children and Adolescents
26. Monitoring Memory in Old Age: Impaired, Spared, and Aware
27. Aging and Metacognitive Control
Index

Citation preview

The Oxford Handbook of Metamemory

O X F O R D L I B R A RY O F P S Y C H O L O G Y

Editor-in-Chief Peter Nathan Area Editors: Clinical Psychology David H. Barlow Cognitive Neuroscience Kevin N. Ochsner and Stephen M. Kosslyn Cognitive Psychology Daniel Reisberg Counseling Psychology Elizabeth M. Altmaier and Jo-Ida C. Hansen Developmental Psychology Philip David Zelazo Health Psychology Howard S. Friedman History of Psychology David B. Baker Methods and Measurement Todd D. Little Neuropsychology Kenneth M. Adams Organizational Psychology Steve W. J. Kozlowski Personality and Social Psychology Kay Deaux and Mark Snyder

OXFORD

L I B R A RY

OF

Editor in Chief

PSYCHOLOGY

peter e. nathan

The Oxford Handbook of Metamemory Edited by

John Dunlosky Sarah (Uma) K. Tauber

1

1 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trademark of Oxford University Press in the UK and certain other countries. Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016

© Oxford University Press 2016 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. Library of Congress Cataloging-in-Publication Data The Oxford handbook of metamemory / edited by John Dunlosky and Sarah (Uma) K. Tauber.   pages cm. — (Oxford library of psychology) Includes index. ISBN 978–0–19–933674–6 1.  Metacognition.  2.  Memory.  I.  Dunlosky, John.  II.  Tauber, Sarah (Uma) K. BF311.O947 2016 153.1′2—dc23 2015023488

9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper

S H O RT CO N T E N T S

Oxford Library of Psychology  vii About the Editors  ix Contributors xi Table of Contents  xiii Chapters 1–558 Index 559

v

O X F O R D L I B R A R Y O F   P S YC H O L O G Y

The Oxford Library of Psychology, a landmark series of handbooks, is published by Oxford University Press, one of the world’s oldest and most highly respected publishers, with a tradition of publishing significant books in psychology. The ambitious goal of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging field and, in so doing, to fill a clear market need. Encompassing a comprehensive set of handbooks, organized hierarchically, the Library incorporates volumes at different levels, each designed to meet a distinct need. At one level are a set of handbooks designed broadly to survey the major subfields of psychology; at another are numerous handbooks that cover important current focal research and scholarly areas of psychology in depth and detail. Planned as a reflection of the dynamism of psychology, the Library will grow and expand as psychology itself develops, thereby highlighting significant new research that will impact on the field. Adding to its accessibility and ease of use, the Library will be published in print and, later on, electronically. The Library surveys psychology’s principal subfields with a set of handbooks that capture the current status and future prospects of those major subdisciplines. The initial set includes handbooks of social and personality psychology, clinical psychology, counseling psychology, school psychology, educational psychology, industrial and organizational psychology, cognitive psychology, cognitive neuroscience, methods and measurements, history, neuropsychology, personality assessment, developmental psychology, and more. Each handbook undertakes to review one of psychology’s major subdisciplines with breadth, comprehensiveness, and exemplary scholarship. In addition to these broadly conceived volumes, the Library also includes a large number of handbooks designed to explore in depth more specialized areas of scholarship and research, such as stress, health and coping, anxiety and related disorders, cognitive development, or child and adolescent assessment. In contrast to the broad coverage of the subfield handbooks, each of these latter volumes focuses on an especially productive, more highly focused line of scholarship and research. Whether at the broadest or most specific level, however, all of the Library handbooks offer synthetic coverage that reviews and evaluates the relevant past and present research and anticipates research in the future. Each handbook in the Library includes introductory and concluding chapters written by its editor to provide a roadmap to the handbook’s table of contents and to offer informed anticipations of significant future developments in that field. An undertaking of this scope calls for handbook editors and chapter authors who are established scholars in the areas about which they write. Many of the vii

nation’s and world’s most productive and best-respected psychologists have agreed to edit Library handbooks or write authoritative chapters in their areas of expertise. For whom has the Oxford Library of Psychology been written? Because of its breadth, depth, and accessibility, the Library serves a diverse audience, including graduate students in psychology and their faculty mentors, scholars, researchers, and practitioners in psychology and related fields. Each will find in the Library the information they seek on the subfield or focal area of psychology in which they work or are interested. Befitting its commitment to accessibility, each handbook includes a comprehensive index, as well as extensive references to help guide research. And because the Library was designed from its inception as an online as well as print resource, its structure and contents will be readily and rationally searchable online. Further, once the Library is released online, the handbooks will be regularly and thoroughly updated. In summary, the Oxford Library of Psychology will grow organically to provide a thoroughly informed perspective on the field of psychology, one that reflects both psychology’s dynamism and its increasing interdisciplinarity. Once published electronically, the Library is also destined to become a uniquely valuable interactive tool, with extended search and browsing capabilities. As you begin to consult this handbook, we sincerely hope you will share our enthusiasm for the more than 500-year tradition of Oxford University Press for excellence, innovation, and quality, as exemplified by the Oxford Library of Psychology. Peter E. Nathan Editor-in-Chief Oxford Library of Psychology

viii

oxford library of psychology

A B O U T T H E   E D I TO R S

John Dunlosky John Dunlosky is a professor of psychology at Kent State University. He has contributed empirical and theoretical work on memory and metacognition, including theories of self-regulated learning and metacomprehension. A major aim of his research program is to develop techniques to improve the effectiveness of people’s self-regulated learning across the lifespan. A  fellow of the Association for Psychological Science, he is a founder of the International Association for Metacognition. He co-authored Metacognition, which is the first textbook on the topic, and has edited several books on metacognition and education. Sarah (Uma) K. Tauber Sarah “Uma” Tauber earned her PhD from Colorado State University, received post-doctoral training at Kent State University, and is currently an assistant professor at Texas Christian University. Her research focuses on how people monitor and control their ongoing learning, and how monitoring and control processes are influenced by aging in adulthood. She is a member of the Psychonomic Society and the International Association for Metacognition.

ix

CO N T R I B U TO R S

Rakefet Ackerman Israel Institute of Technology Shiri Adiv University of Haifa Andre Aßfalg Kwantlen Polytechnic University Elisabeth Bacon Institut National pour la Sante et la Recherche Medicale Ute J. Bayen Heinrich-Heine-Universität Düsseldorf Aaron S. Benjamin University of Illinois at Urbana-Champaign Michael J. Beran Georgia State University Daniel M. Bernstein Kwantlen Polytechnic University Elizabeth Ligon Bjork University of California, Los Angeles Robert Bjork University of California, Los Angeles Daniel Buttaccio University of Maryland Alan D. Castel University of California, Los Angeles Jeffrey S. Chrabaszcz University of Maryland Anne M. Cleary Colorado State University Michael R. Dougherty University of Maryland John Dunlosky Kent State University, Kent, OH Anastasia Efklides Aristotle University of Thessalonik Alexandra Ernst Université de Bourgogne, France

Joshua L. Fiechter University of Illinois at Urbana-Champaign Bridgid Finn Educational Testing Service Nathaniel L. Foster St. Mary’s College of Maryland David A. Gallo University of Chicago Morris Goldsmith University of Haifa Maciej Hanczakowski Cardiff University Christopher Hertzog Georgia Institute of Technology Philip A. Higham University of Southampton Gregory Hughes Tufts University Marie Izaute Université Blaise Pascal Asher Koriat University of Haifa Nate Kornell Williams College Beatrice G. Kuhlmann Heinrich-Heine-Universität Düsseldorf Ragav Kumar Kwantlen Polytechnic University James M. Lampinen University of Arkansas Meeyeon Lee Tufts University Elisabeth Löffler University of Wuerzburg Shannon McGillivray Weber State University Janet Metcalfe Columbia University

xi

Catherine D. Middlebrooks University of California, Los Angeles Karen J. Mitchell Yale University Daniel C. Mograbi King’s College London and PUC-Rio Robin Morris King’s College London Chris J. A. Moulin Université de Bourgogne Michael L. Mueller Kent State University, Kent, OH Timothy J. Hollins University of Plymouth Matthew G. Rhodes Colorado State University Lili Sahakyan University of Illinois at Urbana-Champaign Wolfgang Schneider University of Würzburg Bennett L. Schwartz Florida International University J. David Smith University at Buffalo, The State University of New York

xii Contributors

Rebekah E. Smith University of Texas at San Antonio Nicholas C. Soderstrom University of California, Los Angeles Celine Souchay Université de Bourgogne Sarah (Uma) K. Tauber Texas Christian University Keith W. Thiede Boise State University Ayanna K. Thomas Tufts University Rick P. Thomas Georgia Institute of Technology Joe W. Tidwell University of Maryland Nash Unsworth University of Oregon David A. Washburn Georgia State University Nathan Weber Flinders University Carole L. Yue University of California, Los Angeles Katarzyna Zawadzka University of Southampton

CONTENTS

Prologue: Some Metacomments on Metamemory  1 Robert Bjork

Part One  •  Introduction to Metamemory  5 1. A Brief History of Metamemory Research and Handbook Overview  7 Sarah (Uma) K. Tauber and John Dunlosky 2. Methodology for Investigating Human Metamemory: Problems and Pitfalls  23 John Dunlosky, Michael L. Mueller, and Keith W. Thiede 3. Internal Mapping and Its Impact on Measures of Absolute and Relative Metacognitive Accuracy  39 Philip A. Higham, Katarzyna Zawadzka, and Maciej Hanczakowski

Part Two  • 

Metamemory Monitoring: Classical Judgments  63

4. Judgments of Learning: Methods, Data, and Theory  65 Matthew G. Rhodes 5. Introspecting on the Elusive: The Uncanny State of the Feeling of Knowing  81 Ayanna K. Thomas, Meeyeon Lee, and Gregory Hughes 6. Tip-of-the-Tongue States, Déjà Vu Experiences, and Other Odd Metacognitive Experiences  95 Bennett L. Schwartz and Anne M. Cleary 7. Sources of Bias in Judgment and Decision Making  109 Joe W. Tidwell, Daniel Buttaccio, Jeffery S. Chrabaszcz, Michael R. Dougherty, and Rick P. Thomas 8. The Self-Consistency Theory of Subjective Confidence  127 Asher Koriat and Shiri Adiv 9. Metacognitive Aspects of Source Monitoring  149 Beatrice G. Kuhlmann and Ute J. Bayen

xiii

Part Three  • 

Metamemory Monitoring: Special Issues  169

10. Monitoring and Regulation of Accuracy in Eyewitness Memory: Time to Get Some Control  171 Timothy J. Hollins and Nathan Weber 11. Metamemory and Education  197 Nicholas C. Soderstrom, Carole L. Yue, and Elizabeth Ligon Bjork 12. Prospective Memory: A Framework for Research on Metaintentions  217 Rebekah E. Smith 13. Metamemory and Affect  245 Anastasia Efklides 14. Metamemory in Comparative Context  269 David A. Washburn, Michael J. Beran, and J. David Smith 15. Looking Back and Forward on Hindsight Bias  289 Daniel M. Bernstein, Andre Aßfalg, Ragav Kumar, and Rakefet Ackerman

Part Four  • 

Control of Memory  305

16. The Metacognitive Foundations of Effective Remembering  307 Joshua L. Fiechter, Aaron S. Benjamin, and Nash Unsworth 17. Self-Regulated Learning: An Overview of Theory and Data  325 Nate Kornell and Bridgid Finn 18. The Need for Metaforgetting: Insights from Directed Forgetting  341 Lili Sahakyan and Nathaniel L. Foster 19. Metacognitive Quality-Control Processes in Memory Retrieval and Reporting  357 Morris Goldsmith 20. Three Pillars of False Memory Prevention: Orientation, Evaluation, and Corroboration  387 David A. Gallo and James M. Lampinen

Part Five  • 

Neurocognition of Metamemory  405

21. The Ghost in the Machine: Self-Reflective Consciousness and the Neuroscience of Metacognition  407 Janet Metcalfe and Bennett L. Schwartz 22. The Cognitive Neuroscience of Source Monitoring  425 Karen J. Mitchell 23. Anosognosia and Metacognition in Alzheimer’s Disease: Insights from Experimental Psychology  451 Alexandra Ernst, Chris J. A. Moulin, Celine Souchay, Daniel C. Mograbi, and Robin Morris 24. Metamemory in Psychopathology  473 Marie Izaute and Elisabeth Bacon

xiv contents

Part Six  • 

Development of Metamemory  489

25. The Development of Metacognitive Knowledge in Children and Adolescents  491 Wolfgang Schneider and Elisabeth Löffler 26. Monitoring Memory in Old Age: Impaired, Spared, and Aware  519 Alan D. Castel, Catherine D. Middlebrooks, and Shannon McGillivray 27. Aging and Metacognitive Control  537 Christopher Hertzog Index  559

contents

xv

Prologue: Some Metacomments on Metamemory Robert Bjork

Abstract

In this prologue, I comment on key events in the history of research on metamemory and on my own reactions to those events—beginning with the now-famous research on feeling-of-knowing judgments carried out by Joe Hart 50 years ago when Joe and I were both graduate students at Stanford University. After speculating on why mainstream memory researchers, me in particular, were slow to realize the importance of research on metacognitive processes, even after John Flavell and Henry Wellman had provided an elegant definition of the field during the 1970s, I discuss the events and dynamics that ultimately made it clear that understanding metacognitive processes is a critical component of understanding human learning and memory processes more broadly. Key Words:  metamemory, applied metamemory, memory, history, learning

This handbook provides a uniquely good and complete picture of the rich, varied, and evolving world of research on metamemory-and on metacognitive processes more broadly. Looking back, it puzzles me that I was so slow to appreciate the importance and potential of such research, especially given that I was around at the beginning, so to speak. That beginning, as identified by the editors in their excellent summary of the history of research on metamemory, goes back to 1965, fifty years ago, when Joseph Hart, then a graduate student at Stanford University, carried out and reported experiments on tip-of-the-tongue effects. Hart found, basically, that participants, when unable to answer a general knowledge question of some type, were very accurate in predicting their ability to pick the correct answer from among four alternatives, a finding that raised some fundamental questions about how we monitor and control the workings of our memories. When I say that I was around at the beginning, I  mean more than that I  was alive then. In fact, I  was also a graduate student at Stanford in 1965 and well aware of Joe Hart’s research, but I simply thought of his findings as a kind of curiosity (that

is, if people do not know the answer to a question, how can they judge whether they know it?). Across the next twenty-plus years, I  was also slow to appreciate the importance of the theoretical and empirical contributions of John Flavell and Henry Wellman, who—from a developmental psychology perspective—did nothing less than define metamemory as an identifiable and important domain of research, and I  was slow as well to appreciate Ann Brown’s arguments with respect to the importance of metacognitive processes in educational contexts. I remember thinking, too, that early definitions of metacognition and metamemory, such as “thinking about thinking” or “knowing about knowing,” referred to philosophical matters that might make for an interesting discussion at a cocktail party but not rigorous or productive research—so that, too, may have been a factor in my not seeing the importance of understanding metacognitive processes. Among mainstream cognitive psychologists, I may well have been uniquely slow to realize the importance of understanding metacognitive processes, but—in looking back—I think I  was far from alone. A  search of Google Scholar reveals, 1

for example, that it was only around the mid 1980s—when Tom Nelson and a few others began to show the full potential of metacognitive measures, such as judgments of learning, feeling-ofknowing judgments, ease-of-learning judgments, and so forth—that research on metamemory and metacognition began to appear with reasonable frequency in the mainstream cognitive psychology journals. The number of articles in which either of the terms “judgment of learning” or “feeling of knowing” appeared roughly tripled every decade: going from only 104 during 1965–1974 to 298 during 1975–1984 to 830 during 1985–1994 to 2,090 during 1995–2004 to 6,100 during 2005–2014. Similarly, the number of articles in which “metamemory” or “metacognition” appeared in the title increased from three during 1965–1974 (one unpublished paper, one survey instrument, and one study of metacognition in rhesus monkeys) to 6,530 during 2005–2014. It took a while, too, for research on metacognitive processes to be recognized as its own field of study, rather than as a kind of adjunct to other fields. During the late 1980s and early 1990s, for example, papers on metamemory and metacognition at the meetings of the Psychonomic Society were sprinkled throughout the program in sessions on other topics. Tom Nelson and I urged the Society’s program committee to schedule separate sessions on metamemory and metacognition, but it was not until 1996 that the first such paper session was made part of the Society’s program (the session was actually called “Metacognition and History” because it included, along with four papers on metacognitive processes, a paper on the history of the Psychonomic Society). During that period, there was even, for a few years, a kind of support-group dinner of metamemory researchers during the Society’s annual meeting. But, oh, how things have changed: So many of the papers and posters at the 2014 meeting of the Psychonomic Society included metamemory/metacognition measures that I  gave up trying to count them. Beyond the talks or posters that were presented in the four sessions that were designated explicitly as sessions on metamemory or metacognition, a large number of papers and posters in other sessions, especially the multiple sessions on Human Learning and Instruction and the multiple sessions on Test Effects, included metacognitive measures as a key component of the research being reported. So what changed and led research on metacognitive processes to catch fire among cognitive 2 prologue

psychologists, so to speak? One clear influence was that the 1970s brought the seminal research by Kahneman and Tversky, by Fischoff, and by others on the role of metacognitive heuristics and biases in judgment and decision making. Another important factor was that the theoretical and empirical papers by Flavell and Wellman had a gradually increasing impact on cognitive psychologists, as well as on developmental psychologists. Flavell’s 1979 article in the American Psychologist, “Metacognition and Cognitive Monitoring:  A  New Area of Cognitive-Developmental Inquiry,” has been cited 6,109 times, for example. By contrast, Hart’s 1965 paper, “Memory and the Feeling of Knowing Experience,” has been cited 626 times. The appearance in 1990 of Nelson and Naren’s framework for research on metamemory (reproduced as Figure  4 in Chapter  1) was extremely important for two reasons:  It provided a kind of schematic organization for the disparate methods and measures of the early research on metamemory processes, and it clarified the critical distinction between monitoring processes and control processes. As the editors point out, the appearance of Nelson’s 1992 collection of “core readings” on metacognition was also important (despite the fact that Nelson, when asked by the publisher to shorten the book, decided that my 1988 article, “Retrieval Practice and the Maintenance of Knowledge,” could be sacrificed). In my own particular case, the biggest single “aha moment” came during the late 1980s, at which point evidence was emerging that learners, during the acquisition of skills and knowledge, were susceptible to interpreting their current performance (and their related subjective experiences, such as a sense of fluency or familiarity) as valid indices of learning, whereas such indices are unreliable measures of learning at best, and sometimes are entirely misleading. That is, there was evidence then, and far more since then, that conditions of study or practice that make performance improve rapidly can fail to support long-term retention and transfer, whereas certain conditions that create difficulties and challenges for the learners (which I  came to call “desirable difficulties”; Bjork, 1994)  can enhance long-term retention and transfer. That pattern of memory and metamemory findings seemed critically important to me because it suggested not only that students and teachers alike were susceptible to choosing (and preferring) poorer conditions of learning over better conditions, but also, more broadly, that we, as learners, tend to have a faulty

mental model of how we learn and remember or fail to learn and remember. After that realization, it seemed incumbent on me to try to make all of the several hundred students in my cognitive psychology course better learners. Part of doing so, I  decided, was to demystify the terms “metamemory” and “metacognition,” terms that can elicit glassy-eyed stares from students. As part of trying to make those terms and the importance of metacognitive processes more concrete, I developed a list of metacognitive decisions and judgments my students were already making but did not think of as metacognitive processes, such as deciding how, when, and what to study; making preexam judgments of preparation and post-exam judgments of performance; interpreting fluency of reading or retrieval as indices of comprehension; and on and on. I remain unsure of how much impact my list and associated concrete examples had on my students, if any, but I completely convinced myself as to why it was so important to understand metacognitive processes, and the argument seems all the more compelling now. In our increasingly complex and rapidly changing world, knowing how to learn has never been more important, especially as both the need and opportunities for us to learn on our own outside of formal classroom settings continues to grow—not simply during the years of formal schooling but across our lifetimes. The way that controlled research on metacognitive processes has blossomed, branched out, and spread, as documented in this handbook, is quite amazing. Were this handbook to be used as a textbook in a college seminar, that seminar would be very different and far broader that such a seminar only ten years ago—and it might require more than a single academic term. This handbook points in interesting ways to the future as well. There seems little doubt, for example, that research of

the type reported in Part V, Neurocognition of Metamemory,” will explode during the next decade. Identifying the patterns of brain activity that correspond to different levels of metacognitive judgments, such as feeling of knowing or judgment of learning, and how those patterns match and mismatch the patterns associated with different levels of actual recall or recognition performance, seems especially important, and the possible clinical applications of neurocognitive research in metacognition seem important as well. More broadly, this handbook identifies issues in metamemory research that are critical and applications of metamemory findings that are potentially important, and it does so with respect to multiple contexts, such as the classroom, the courtroom, and the self-management of our own learning, as well as with respect to phases of our lives, from the development of metacognitive processes early in life to how those processes change with subsequent aging. It is difficult to say just how research on metacognitive processes will ebb and flow and spread during the next 10 years, but it should be exciting.

References

Bjork, R. A. (1988). Retrieval practice and the maintenance of knowledge. In M.  M. Gruneberg, P.  E. Morris, & R.  N. Sykes (Eds.), Practical aspects of memory II (pp. 396–401). London, England: Wiley. Bjork, R.  A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe and A.  Shimamura (Eds.), Metacognition:  Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press. Flavell, J.  H. (1979). Metacognition and cognitive monitoring:  A  new area of cognitive-developmental inquiry. American Psychologist, 34, 906–911. doi: 10.1037/0003-0 66X.34.10.906 Hart, J.  T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216. doi: 10.1037/h0022263

prologue

3

PA RT  

Introduction to Metamemory

1

CH A PT E R

1

A Brief History of Metamemory Research and Handbook Overview

Sarah (Uma) K. Tauber and John Dunlosky

Abstract Metamemory has a rich history: Its empirical and theoretical roots can be traced back to at least 1965, although metamemory techniques have been developed and discussed since Aristotle. In this chapter, we describe the origins of metamemory research by showcasing some founders of the field and their methodological and theoretical contributions. Joseph Hart conducted what is considered the first objective metamemory research, John Flavell coined the term metamemory in 1971 and provided theoretical fodder for the field, and Ann Brown brought early attention to metamemory by emphasizing its relevance to education. In 1990, Nelson and Narens introduced a framework that unified the field, which remains influential today. The chapter follows the early progression of metamemory research and foreshadows contemporary approaches to metamemory. It ends with a user’s guide to this handbook, including an overview of each section, an introduction to individual chapters, and recommendations for how to approach the Handbook. Key Words:  metamemory, history, metacognition, monitoring, control

“No one has ever written the history of any period of thought or of life without being greatly puzzled about the point at which to begin it. For whatever event be chosen as the first of the chronicle, this hypothetically first event is conditioned by other events. Every history, therefore, begins at a more or less arbitrary point.” —Calkins, 1910, p. 17

Mary Whiton Calkins’ (1910) wisdom is especially true for choosing the first event of the metamemory chronicle:  Do we begin with the ancient Greeks who pondered about the nature of memory and developed the first mnemonics to control it (both of which are acts of metamemory) or with the introspectionist movement in the early1900s where self-observation was a central tool of enquiry, or should we pinpoint the birth of metamemory with more contemporary events that

include coining the term or developing the first experimental methods to investigate it? In a prior chapter on the history of metamemory, Dunlosky and Metcalfe (2009) began with the ancient Greeks. And why not? Aristotle’s theory of memory is profound and includes elements of modern theory of distinctive processing (Robinson, 1989), and Simonides developed one of the most powerful metamemory mnemonics to date (Yates, 1997). Nevertheless, it is debatable whether these and other philosophers were analyzing metacognition (thinking about thinking) per se, and it is certainly the case that they were not conducting experimental research about metamemory. Thus, we decided to begin our brief history with the beginning of experimental research on metamemory.

A Brief History of Metamemory Research

Even here it is difficult to know where to begin, but a defensible start is with Joseph Hart. Hart 7

was the first to emphasize the importance of objectively evaluating the accuracy of metamemory states. In 1965, he published the first empirical paper that estimated the accuracy of feeling-ofknowing (FOK) judgments (for more on FOKs, see Thomas, Lee, & Hughes, this volume). To do so, he created a new task that has been dubbed the recall-judgment-recognition (RJR) paradigm, which capitalized on the well-established effect that people tend to recognize more information than they can recall. Hart’s participants completed an initial recall test of basic facts (e.g., Who wrote From Here to Eternity?) and for the items they could not answer they provided a dichotomous FOK (i.e., yes or no). FOK judgments indicated whether participants knew the answer to a question, even though they could not currently recall it. In contrast with prior work, Hart established the accuracy of these introspective judgments by administering an objective criterion test of memory. Participants were given a four-alternative recognition test for items that they had answered incorrectly on the recall test. To objectively assess FOK accuracy, Hart computed recognition memory scores to evaluate whether they were higher when people predicted they would recognize the answer (FOK of “yes”) than when they said they wouldn’t (FOK of “no”). As evident from Table 1.1, people’s FOKs did demonstrate above-chance accuracy, which was replicated in a follow-up experiment (that employed graded FOK judgments), among many others (Hart, 1966, 1967a, 1967b). Thus, Hart established that metamemory states can be accurate: “When people feel that they know something, it is very likely that they do know it, and when they feel that they do not, it is likely that they do not” (Hart, 1967b, p. 196). One reason that FOKs have attracted so much interest is the simple fact that above-chance FOK is mysterious:  People presumably can feel information that is (or is not) stored in memory, even though they currently have no access to that information! Contemporary solutions to this mystery—pertaining to FOKs and other metamemory judgments—are presented by Thomas, Lee, and Hughes (this volume). But, in the late 1960s, Hart (1967a) argued that monitoring is based on direct access to the memory traces for target items. That is, everyone presumably has privileged access to how strongly information is stored in his or her own memory. For instance, when presented with the question “Who wrote From Here To Eternity?,” a person determines whether an answer (i.e., James Jones) is available in memory and uses the strength of the 8

memory representation for the answer as a basis for the judgment. If an answer was unavailable or if an answer was available but its strength fell below a threshold, then the response for the FOK judgment will be “no.” Alternatively, if an answer was available and had a strong enough trace, then the FOK will be “yes.” This direct-access account of monitoring

Table 1.1  Proportion correctly recognized on the final recognition test in  the recall-judgment-recognition (RJR) paradigm from Hart’s landmark papers. Proportion Correct on Final Recognition Test with “Yes” or High-Value FOKs

Proportion Correct on Final Recognition Test with “No” or Low-Value FOKs

Hart (1965) Experiment 1

.76

.43

Hart (1965) Experiment 2

.66

.38

Hart (1966) Experiment 1

.70

.39

Hart (1966) Experiment 2

.52

.39

Hart (1967a) Experiment 1 One Exposure Group

 

 

.47

.35

Two Exposures .48 Group

.38

Three Exposures .66 Group

.48

Hart (1967a) Experiment 2

.63

.47

Hart (1967b) First Recall Test

.62

.38

Hart (1967b) Second Recall Test

.59

.30

Note. FOK = Feeling-of-Knowing judgment. The design in Hart (1967a, Experiment 1) included three groups of participants: a group who studied the items once, a group who studied the items twice, and a group who studied the items three times. The design in Hart (1967b) adopted a first recall test, a second recall test, FOK judgments, and a final recognition test. Thus, each line for the Hart (1967b) paper indicates performance on the final recognition test relative to FOK judgments and either the first recall test, or the second recall test.

A Brief History of Metamemory Research and Handbook Overview

accuracy predicts that people’s accuracy should be consistently high, mainly because the judgments are directly based on the underlying strength of how well answers are stored in memory, which also should drive later recognition performance. This account is historically important because it provided the first testable predictions about FOK accuracy, even though it has been disconfirmed numerous times since Hart published his influential papers. The direct-access account was most eloquently rebutted by Koriat (1993, 1997), who among others demonstrated dissociations between metamemory judgments and test performance (for further discussion, see Koriat & Adiv, this volume). Many of the chapters in this volume have their foundations in Hart’s work, and his insight to develop the methods to measure metamemory accuracy has had an enormous impact. It was John Flavell, however, who gave the field of metacognition its name and a voice of its own. In 1971, Flavell coined the term metamemory1: “What, then, is memory development the development of? It seems in large part to be the development of intelligent structuring and storage of input, of intelligent search and retrieval operations, and of intelligent monitoring and knowledge of these storage and retrieval operations—a kind of ‘metamemory’, perhaps. Such is the nature of memory development. Let’s all go out and study it!”. (Flavell, 1971, p. 227)

In his landmark paper, Flavell (1979) also introduced a model of cognitive monitoring, which included four components:  (1)  metacognitive knowledge, (2)  metacognitive experiences, (3)  goals or tasks, and (4)  actions or strategies. Metacognitive knowledge is beliefs or theories that one may have about their cognition. For example, the belief that a friend would perform better on an upcoming exam than you will because you think your friend is a better test-taker. Metacognitive knowledge can also include beliefs about your ability as a learner, beliefs about a given learning task, or beliefs about strategies to use to succeed at a task. Metacognitive experiences include cues that arise while completing a task that may be relevant for understanding your own cognition. An example here is if you feel like you understand new material for a class because, as you study, the ideas seem to make sense without much effort. Goals (or tasks) refer to the desired outcome of a given cognitive effort, and actions (or strategies) refer

to the methods one can use in order to achieve a goal. Flavell’s (1979) model provided a theoretical foundation for describing, understanding, and investigating metacognition in general and metamemory more specifically. He provided big questions for the field, and he emphasized the importance of answering them: “For example, how much good does cognitive monitoring actually do us in various types of cognitive enterprises? Also, might it not even do more harm than good, especially if used in excess or nonselectively? … Lack of hard evidence notwithstanding, however, I am absolutely convinced that there is, overall, far too little rather than enough or too much cognitive monitoring in this world. This is true for adults as well as for children, but it is especially true for children. For example, I find it hard to believe that children who do more cognitive monitoring would not learn better both in and out of school than children who do less.” (Flavell, 1979, p. 910)

Not only does Flavell’s early agenda continue to drive research today, he also conducted some of the first research about children’s metamemory. As one example, Flavell and his colleagues systematically investigated children’s beliefs about their memory. To do so, they conducted detailed interviews with children who were in kindergarten through fifth grade (Kreutzer, Leonard, & Flavell, 1975). The children were asked a variety of questions about their memory (e.g., Do you forget? Do you remember well―are you a good rememberer? Can you remember better than your friends?), as well as a host of questions about general memory principles such as the effect of relearning on memory and the effect of relatedness on memory (see Table 1.2). As evident from Figure 1.1 (left panel), the majority of participants in all grades expected that Jim would find it easier to learn the bird names— suggesting that children as young as kindergarten-aged have beliefs about relearning. Moreover, given the well-established benefits of relearning on memory, children’s beliefs about relearning are accurate. However, such accuracy was not evident at all ages for other principles. For example, consider the influence of relatedness on memory. Related information (e.g., a pair of words such as boy-girl) is commonly easier to learn and better remembered relative to unrelated information (e.g., a pair of words such as Mary-walk). As evident from Figure 1.1 (right panel), young participants Tauber and Dunlosky

9

Table  1.2 Example Items from  the Metamemory Interview conducted by Kreutzer, Leonard, and Flavell (1975). Memory Principle

Question

The Effect of Relearning on Memory

“Jim and Bill are in grade __ [participant’s own grade]. The teacher wanted them to learn the names of all the kinds of birds they might find in their city. Jim had learned them last year and then forgot them. Bill had never learned them before. Do you think one of the boys would find it easier to learn the names of all the birds? Which one? Why?” (Kreutzer et al., 1975, p. 8).

The Effect of Relatedness on Memory

“These words are opposites: “boy” goes with “girl,” “easy” goes with “hard.” And these words are people and things they might do. So “Mary” goes with “walk” (etc.). Do you think one of these would be easier for you to learn? Why?” (Kreutzer et al., 1975, p. 14).

(i.e., kindergarteners and first graders) expected unrelated pairs (e.g., Mary-walk) to be easier to remember or equivalently easy to remember compared with related pairs (e.g., boy-girl). In contrast, older participants (i.e., third graders and fifth graders) expected related pairs to be easier to remember than unrelated pairs. These data suggest that older children (third and fifth grade) have beliefs

Number of Participants Who Reported Each Response

25

Jim (Relearner) Same

20

Bill (New Learner) Do Not Know

about the impact of relatedness on memory and that these beliefs are accurate, but that these beliefs have not formed by the first grade. Kreutzer et al.’s original metamemory interview and adaptations of it have been used widely to explore the development of metamemory (e.g., Borkowski, Peck, Reid, & Kurtz, 1983; Kurtz & Borkowski, 1987; Lockl & Schneider, 2007; Schneider, Borkowski, Kurtz, & Kerwin, 1986), which is explored in detail by Schneider and Löffler (this volume). As important, their pioneering research foreshadowed the frequent use of interviews and detailed questionnaires to explore other topics. Zelinski, Gilewski, and Thompson (1980) introduced the Metamemory Questionnaire (MQ) and portions of it were subsequently incorporated into the Memory Functioning Questionnaire (MFQ; Gilewski, Zelinski, & Schaie, 1990). The MFQ includes subscales to assess people’s perceptions of their own frequency of forgetting, severity of forgetting, memory function over time, and strategy use. As a second example, Dixon, Hultsch, and Hertzog (1988) created the Metamemory in Adulthood (MIA) questionnaire to assess older adults’ beliefs about their own cognition. The MIA questionnaire is designed to measure multiple aspects of adults’ metamemory (for reviews see Dixon, 1989; Hertzog & Hultsch, 2000). And as a final example, Pintrich, Smith, Garcia, and McKeachie (1993) created and evaluated the Motivated Strategies for Learning Questionnaire (MSLQ). This questionnaire includes scales of motivation (e.g., self-efficacy, test anxiety) and of learning strategies. For the latter, subscales include resource management, cognition, and metacognition. The metacognition subscale incorporates questions about self-regulation (e.g., “I often find that I have been reading for class but Related Pairs

Unrelated Pairs

Same

15 10 5 0

K

1

3 Grade Level

5

K

1

3

5

Grade Level

Figure 1.1  The number of participants who reported each response in Kreutzer, Leonard, and Flavell (1975). K = kindergarten, 1 = first grade, 3 = third grade, 5 = fifth grade. Left panel = responses to a question about the effect of relearning on memory. Right panel = responses to a question about the effect of relatedness on memory. Figures generated based on data from Kreutzer et al. (1975).

10

A Brief History of Metamemory Research and Handbook Overview

don’t know what it was all about.”). This is not an exhaustive list of metamemory questionnaires, but these suffice to illustrate that metamemory questionnaires and interviews have proven valuable for exploring people’s metamemory knowledge, monitoring, and use of control strategies. As a graduate student of Flavell, Henry Wellman was another early contributor to establish metamemory as a field. He advocated for increased research on metamemory and conducted some of the first research to investigate it (e.g., Flavell & Wellman, 1977; Wellman, 1977, 1978). At the same time, Ann Brown championed metacognitive research by emphasizing its importance in educational contexts. In 1975, she provided a comprehensive review of the new field in which she pointed out that the majority of research on the development of metamemory was conducted in the laboratory and in the context of tightly controlled memory tasks. She emphasized the need to create effective training programs designed to improve children’s metamemory and memory and to investigate such programs in the wild—in classrooms and with respect to students’ learning. Critically, her early contributions have inspired decades of metamemory research relevant to education. Some of these advances are highlighted in chapters in the present volume (e.g., Fiechter, Benjamin, & Unsworth; Soderstrom, Yue, & Bjork). In fact, Soderstrom et al. (this volume) provide an in-depth discussion of educational implications of metamemory research and recommend procedures to improve students’ metamemory and their achievement. Many others made important contributions to the growing field of metamemory—too many to acknowledge in our brief history, but fortunately, Nelson’s (1992) compilation of core readings on metacognition is an excellent source for many of the foundational papers on metamemory as well as those focused more broadly on metacognition. A quick peek at this book shows that in its infancy, the field began with high-quality research focused on many different aspects of metacognition. Using these core readings as inspiration, in Figure 1.2 we offer a timeline of foundational research on metamemory (for a historical timeline of research on study time decisions, see Son & Kornell, 2008). The timeline consists of a mix of ground-breaking papers that were the first published in an area or that have had a major impact on the field (some of which meet both criteria). We also wanted the timeline to reflect the organization of the Oxford Handbook of Metamemory, so we targeted papers

that are relevant for the areas of research represented in this handbook. With that said, some chapters in the handbook are not represented in the 1965–1995 timeline because the areas are either well-established with papers that are impressively old, or are relatively new with the landmark papers being rather recent. For example, two chapters in the present volume discuss metamemory in legal contexts:  Gallo and Lampinen (this volume) consider how metamemory processes can act as an avenue to prevent false memories, and Hollins and Weber (this volume) provide a detailed discussion of eyewitness confidence. In both cases, the initial research can be traced back to Munsterberg (1908), who was among the first to contemplate subjective experiences of eyewitness memory: “The public in the main suspects that witnesses lie, while taking for granted that if he is normal and conscious of responsibility he may forget a thing, but it would not believe that he could remember the wrong thing. The confidence in the reliability of memory is so general that the suspicion of memory illusions evidently plays a small role in the mind of the juryman …” (Munsterberg, 1908, p. 36)

Given that Munsterberg’s foundational book was published in 1908, metamemory in legal contexts is not represented in the timeline (which begins in 1965). On the more recent end of the timeline, two chapters in this volume introduce new areas of metamemory research. Sahakyan and Foster (this volume) provide an overview of metamemory processes in the context of directed forgetting, which they refer to as metaforgetting. Smith (this volume) outlines a new research program for metaintentions, which pertain to the metamemory of prospective memory. Other papers that appear in the timeline are foundational for more than one chapter in this volume. As one example, Wellman (1978) conducted influential research on metamemory from a developmental perspective, which is relevant for all three chapters focused on the developmental processes. The readings provided in the timeline will offer solid footing for any new (or experienced) researcher in the field. As previously mentioned, many of these papers can be found in Nelson’s (1992) edited volume and more details on these (and other) significant papers are also discussed in chapters in this handbook. We highly recommend reading these foundational papers, if you haven’t already. When you do, you may notice that there are few Tauber and Dunlosky

11

Feeling of Knowing (FOK) Hart (1965)

1965 Tip of the Tongue (TOT) Brown & McNeill (1966)

Self-Paced Study

Judgment of Learning (JOL)

Zacks (1969)

Arbuckle & Cuddy (1969)

Judgment & Decision Making

Hindsight Bias

Tversky & Kahneman (1974)

Development in Childhood Wellman (1978)

Fischhoff (1975)

Source Monitoring Johnson, Raye, Wang, & Taylor (1979)

Strategy Use

Feeling of Confidence

Education & Metamemory

Fischhoff, Slovic, & Lichtenstein (1977)

Markman (1977)

1980

Development in Adulthood

Measurement of Accuracy

Bruce, Coyne, & Botwinick (1982)

Nelson (1984)

Neuroscience of Metamemory Shimamura & Squire (1986)

Reder (1987) Metamemory Methods

Prospective Memory

Nelson & Narens (1990)

Einstein & McDaniel (1990) Nonhuman Animal Metacognition

Metamemory in Alzheimer’s disease McGlynn & Kaszniak (1991)

1995

Smith, Schull, Strote, McGee, Egnor, & Erb (1995) Figure 1.2  A historical timeline of foundational metamemory research relevant for the current volume.

connections drawn between the areas of research. This communication gap still persists, albeit to a lesser degree. Even so, in 1990, Nelson and Narens introduced a framework that has aided in unifying the field. The framework connected disparate areas of inquiry and by doing so created a larger community of metamemory researchers. Nelson and Narens (1990) built their framework upon a model of metamemory that capitalized on the interrelated nature of metamemory and memory (see Figure 1.3). In this model, memory is the object level and metamemory is the metalevel. The metalevel contains a model of the object level. Information is transmitted from the object level to the metalevel via monitoring, and information is transmitted back to the object level from the metalevel via control. To illustrate these two processes, 12

consider students who are studying in preparation for an upcoming exam. Monitoring refers to evaluating the current status of learning. For instance, students studying for an exam may evaluate whether they know relevant information or not while they are studying. Control refers to regulatory actions, which can modify cognition, such as the decision to stop studying or decisions about how to study. For instance, if students judge that some information has not been well learned, they may be more likely to restudy that information or use a different encoding strategy to learn it. In this example, monitoring is used to make control decisions, which highlights the reciprocal interplay between monitoring and control process as represented in Figure 1.3. Nelson and Narens (1990) used this model of metamemory as a basis for a cohesive framework,

A Brief History of Metamemory Research and Handbook Overview

META-LEVEL

Monitoring

Control

Flow of Information

OBJECT-LEVEL

(1)

(2)

Control

Monitoring

(3)

Monitoring

Monitoring Figure 1.3  Nelson and Narens (1990) original theoretical model of memory and metamemory. Reprinted from Nelson (1990, pp. 125–173), with permission from Elsevier.

which specified the measures that can be used to tap both monitoring and control. The measures included monitoring judgments and control decisions organized by the phases of learning: acquisition, retention, and retrieval. We reprinted the original framework in Figure 1.4, and an updated version of it is presented in Dunlosky, Mueller, and Thiede (this volume). Using this framework, researchers can easily fit individual research projects within the larger scope of metamemory research, and the framework has inspired some to expand their research programs to include joint analysis of both monitoring and control relationships. Because of this, their framework continues to have an influence on the field, and since 1990, a lot has happened in metamemory research, and the field continues to expand. In this handbook, you will find systematic reviews of classical and contemporary issues about metamemory. To preview, we have grouped chapters in sections that are roughly organized by the Nelson and Narens (1990) framework, with chapters being grouped as to whether they emphasize monitoring of learning or control of study. Each chapter introduces a different area of metamemory

research, some of which have been extensively investigated (e.g., judgments of learning, Rhodes, this volume) and some of which represent new frontiers for metamemory research (e.g., prospective memory and metaintentions, Smith, this volume). Each chapter includes discussion of theoretical perspectives, presentation of key findings that provide evidence relevant to those perspectives, and suggestions for future research. For the latter, we urged authors to reveal the most cutting-edge ideas that will foster innovative advances for the field, and we were thrilled how well they delivered—open any chapter, and you will find exciting ideas to pursue. Of course, we hope you have already read Bjork’s (this volume) prologue to this Handbook, which provides a perspective on the field from someone who has watched the field arise as well as made major contributions to its growth. Here is a preview of the sections, focusing on those chapters that have not already been mentioned.

Handbook Overview Introduction to Metamemory

The handbook begins with chapters that provide an overview about methods and measurement Tauber and Dunlosky

13

M O N I T O R I N G

JUDGMENTS OF KNOWING

CONFIDENCE IN RETRIEVED ANSWERS

CASE-OF-LEARNING JUDGMENTS

IN ADVANCE OF LEARNING

C O N T R O L

FEELING-OF-KNOWING JUDGMENTS

ON-GOING LEARNING

SELECTION OF KIND OF PROCESSING

ALLOCATION OF STUDY TIME

MAINTENANCE OF KNOWLEDGE

TERMINATION OF STUDY

SELF-DIRECTED SEARCH

SELECTION OF SEARCH STRATEGY

OUTPUT OF RESPONSE

TERMINATION OF SEARCH

Figure 1.4  Nelson and Narens (1990) original metamemory framework. Reprinted from Nelson (1990, pp. 125–173), with permission from Elsevier.

issues in metamemory. In this section, Dunlosky, Thiede, and Mueller (this volume) provide an in-depth look at the pitfalls that researchers face as they conduct metamemory research. Their chapter is a tribute and extension to one by Schwartz and Metcalfe (1994) in which they are the first to point out some of the unique challenges for conducting quality metamemory research. Higham, Zawadzka, and Hanczakowski (this volume) provide an insightful discussion of how to measure and interpret absolute accuracy, which is often examined in metamemory research but may easily lead to incorrect conclusions about people’s monitoring ability. Their insights have startling and important implications for research on the accuracy of metamemory judgments.

Metamemory Monitoring: Classical Judgments

The second section of the volume includes chapters that introduce the classical judgments used to measure and investigate people’s memory monitoring. Each chapter introduces a judgment and provides thorough discussion of research with that judgment type. The section begins during the study phase, with Rhodes’ (this volume) chapter on 14

people’s judgments of learning (JOLs), which covers everything from the seminal work by Arbuckle and Cuddy (1969) to the most current evidence for the prevailing theoretical accounts of JOLs. In his “Final Thoughts,” Rhodes offers several core themes to inspire and guide research on JOLs. Moving to the monitoring of retrieval, Thomas et al. (this volume) provide an in-depth analysis of FOK judgments and their accuracy. Their historical overview complements the current chapter, and their discussion of the development of theory of FOKs goes well beyond. Their emphasis on the role of diagnostic cues to FOK accuracy sets a research agenda for the next generation of FOK studies. Schwartz and Cleary (this volume) discuss some frustrating— albeit fascinating—experiences that can occur during retrieval: tip-of-the-tongue experiences, Déjà vu, and blank-in-the-mind states. They argue that these pesky problems of retrieval produce conflict that compels us to resolve it through metacognitive control. Retrospective confidence judgments are arguably the most widely investigated metamemory judgments, and several chapters explore their bases and role in decision making. Tidwell, Buttaccio, Chrabaszcz, and Doughtery (this

A Brief History of Metamemory Research and Handbook Overview

volume) argue that many phenomena involving confidence judgments—such as overconfidence and the hard-easy effect—are best understood by acknowledging the dependence of the judgments on the output from the memory system. Taking an approach that focuses on people’s subjective experience when making judgments, Koriat and Adiv (this volume) explain how the relation between confidence and consensuality can account for the inaccuracy in people’s confidence judgments. People’s confidence in answers is related to the normative consensus about the answers, so that when consensus points toward an incorrect answer, people’s confidence judgments will be inaccurate. Perhaps most important, the math models of confidence judgments of Tidwell et al. and the self-consistency model of Koriat and Adiv provide testable predictions that can guide future explorations of the accuracy of confidence judgments. The final chapter in this section considers a different kind of monitoring judgment relevant to retrieval process—namely, judgments of source. For this judgment, the focus is not on whether a particular memory is correct, but instead is on the source of the memory, such as whether you heard a particular claim from a doctor or your mother, or whether the origin of a particular memory was from an internal source (imagining) or external source. Kuhlman and Bayen (this volume) discuss the major challenge for measuring source memory (e.g., involving disentangling it for content memory and guessing) and explain how multinomial modeling can be used to overcome it. Besides their in-depth review of current evidence and theory of source monitoring, they point to a fascinating new avenue for research that pertains to people’s ability to predict their source monitoring.

Metamemory Monitoring: Special Issues

Most of the chapters under the Special Issues section consider questions and issues relevant to metacognitive monitoring, and some highlight the relevance of classical judgments to applied domains. For instance, Hollins and Weber (this volume) discuss how accurate witnesses are when they judge their confidence in their memory for a criminal. Importantly, they highlight subtle differences between methods used to study classical judgment in the laboratory versus methods to study eyewitness accuracy, and then provide an in-depth analysis of the literature on monitoring and control processes relevant to eyewitnesses. Soderstrom, Yue,

and Bjork (this volume) consider metamemory in the classroom and emphasize the unfortunate conclusion that often metamemory judgments reflect students’ illusions about learning in a way that can lead them to underperform. Importantly, they describe some procedures that show promise for helping to improve the sophistication of students’ metamemory. Finally, Smith puts a new twist on metamemory predictions by developing a framework for research on metaintentions, which pertains to people’s metamemory for their prospective memory. This new framework has important applied and theoretical implications that we suspect will foster innovative research programs on metaintentions. Much of the basic research on metamemory has been conducted sans emotion—cold hard science about judgment and decision making relevant to memory. Fortunately, Efklides (this volume) is turning up the heat by arguing for the important synergy between metamemory and emotion, where metamemory experiences give rise to feelings like curiosity and surprise. She outlines the potential interrelations between metamemory and affect that can be used to develop research questions within this important domain of metamemory. Whether these emotional responses also pertain to nonhuman metamemory is an open issue, but then again, so is the status of nonhuman metamemory:  Do dogs, dolphins, and dormice have metamemory? Can they predict what they know and don’t know? The fascinating issue of whether nonhuman animals have metamemory is explored by Washburn, Beran, and Smith (this volume). They provide compelling evidence from psychophysical testing paradigms, information-seeking paradigms, and computerized testing paradigms (among others) that suggests that some nonhuman animals have metacognitive abilities. Finally, in their chapter on hindsight bias, Bernstein, Aßfalg, Kumar, and Ackerman (this volume) look backward historically and also look forward to three new avenues for research while exploring links between the hindsight bias and metamemory. The contribution of confidence accuracy (or lack thereof ) to producing this bias is provocative, and their perspective on metamemory and the hindsight bias is groundbreaking and provides a fresh agenda for research on this historically important and widely explored phenomenon.

Control of Memory

The fourth section of the handbook explores the control of memory, and it includes several chapters that provide comprehensive reviews of study Tauber and Dunlosky

15

choices, strategies, and self-regulated learning. The chapters have one important commonality—each was written by experts who have also investigated metamemory monitoring. Thus, their theoretical perspectives on control are informed—implicitly if not explicitly so—by the reciprocal nature between monitoring and control processes (Figure 1.3). Fiechter et  al. (this volume) provide a great beginning for this section, because they discuss the importance of strategic control at both encoding and retrieval, whereas the other chapters largely focus on either encoding or retrieval. They argue that understanding individual differences in strategic control is critical, because how people control their memory influences how well they perform, and it is as (and perhaps more) important than any hardwired differences in memory capacity or the integrity of the memory system. This perspective, along with discovering effective strategies, sets an important agenda for future work in this area. Several chapters delve more deeply into the control of encoding. Kornell and Finn (this volume) offer a useful taxonomy on the regulation of study that separates large-scale decisions, such as when to study, from small-scale decisions, such as how long one studies after deciding to begin. Their emphasis on improving both kinds of decision leads them to consider how to improve the accuracy of monitoring that is used to control encoding. Sahakyan and Foster (this volume) also uncover the far-reaching power of self-regulation when discussing a new perspective on the strategic control of forgetting. They coin the term metaforgetting and, as important, they explore how metamemory contributes to—and may be largely responsible for—directed-forgetting effects. They convincingly argue how theories of metamemory need to account for people’s intentional forgetting and at the same time demonstrate how a complete understanding of forgetting will need to incorporate metamemory processes. Goldsmith (this volume) and Gallo and Lampinen (this volume) consider different aspects of how people control retrieval. Both chapters go beyond a standard review of the literature and offer new perspectives on control theory that will move this field a large step forward. Goldsmith provides a thorough synthesis of control processes with careful consideration for the quality of responses. He uses a production line metaphor to illustrate the “back-end” and “front-end” processes by which people can strategically reduce errors when reporting and retrieving from long-term memory. Gallo and Lampinen put a magnifying glass on false memories 16

and consider three retrieval processes—selective search, evaluation, and corroboration—that can help people determine whether an event actually happened in their past. Their interactive framework of metamemory-controlled false memories provides exciting avenues for future research that are relevant to further refining theory of false memories, as well as in identifying conditions to reduce them.

Neurocognition of Metamemory

The fifth section of the handbook is centered on describing and understanding the neurocognition of metamemory. These chapters are focused on establishing the neurological bases of metamemory in general (Metcalfe & Schwartz, this volume) and of source monitoring in particular (Mitchell, this volume; and for further insight into source monitoring, check out Kuhlmann & Bayen, this volume). Metcalfe and Schwartz (this volume) discuss the challenges of identifying the neurological underpinnings of metamemory processes, with perhaps the largest challenge arising from the level of complexity inherent with these processes. They also identify two aspects of metamemory processes that can be used to guide research and theory on the neurocognition of metamemory. The first is that metacognitions are conscious experiences and the second is that they are self-referential. Mitchell (this volume) provides a historical context and current findings of the neurological roots of source monitoring (e.g., remembering whether your best friend or your husband told you information about an upcoming event). She highlights several neurological mechanisms that are critical for understanding the neurocognition of source monitoring. Other chapters explore neurological populations and explain how they inform us about metamemory processes. Izaute and Bacon (this volume) consider metamemory capabilities in psychopathic populations and Ernst, Moulin, Souchay, Mograbi, and Morris (this volume) consider the metamemory capabilities of people with anosognosia and Alzheimer’s disease. Izaute and Bacon (this volume) discuss the function of metamemory processes in several clinical populations, including those with depression, obsessive compulsive disorder (OCD), and autism. They note that metamemory function in these populations is not yet well understood, so increased research with these groups will be critical for the future. They also provide an in-depth view of metamemory function in schizophrenia populations, which have been more heavily investigated.

A Brief History of Metamemory Research and Handbook Overview

Izaute and Bacon recommend that the next wave of research should focus on helping different clinical populations harness intact metamemory processes toward correcting any deficits in their cognitive processes. Ernst et  al. (this volume) provide a theoretical framework to unify research on metamemory and anosognosia, and they use this perspective to evaluate metamemory function in populations diagnosed with Alzheimer’s disease (AD). They provide an excellent overview of metamemory in AD populations, including issues related to overconfidence, awareness, sensitivity, fractionation, neural substrates, and metacognitive control. They also introduce “implicit awareness” as an exciting new direction to increase our understanding of metamemory function in AD populations.

Development of Metamemory

The last section of the handbook introduces research and theory on developmental issues in metamemory. Schneider and Löffler (this volume) provide a detailed account of the rich history of research on the development of metamemory from childhood through adolescence. For instance, they discuss research about the development of self-reflection of mental states and on theory of mind and argue that the development of these initial states facilitates subsequent metacognitive knowledge. They highlight issues that are critical for understanding the development of metamemory processes, as well as for understanding metamemory processes more generally. Two chapters consider how aging in adulthood influences metamemory, with one examining monitoring (Castel, Middlebrooks, & McGillivray, this volume) and the other control (Hertzog, this volume). Castel et al. (this volume) synthesize the literature by arguing that older adults (typically 65+ years old) have impaired monitoring accuracy (in some contexts) and spared monitoring accuracy (in other contexts). For instance, they provide evidence leading to the conclusion that relative monitoring of learning remains largely intact with age but that overconfidence at test represents a common age-related impairment. Finally, Hertzog (this volume) explores how aging influences metamemory control processes, both at encoding and at retrieval. As with the literature on monitoring and aging, Hertzog describes how some control processes appear to be disrupted by aging, whereas others are preserved. A major theme of his chapter is how older adults may be able to rely on intact metamemory processes to improve their memory.

User’s Guide to the Handbook

As previously mentioned, we organized the handbook using the Nelson and Narens (1990) framework as a guide. However, the chapters in this handbook can be approached in other ways as well. You can engage them by selecting particular sections of interest or by simply starting at the beginning and reading through to the end. We also provide another alternative in Table 1.3 in which we have reorganized the chapters with respect to four key issues:  (1)  overviews of each area of research, (2)  new perspectives on existing areas of inquiry, (3) innovative areas of research in metamemory, and (4) metamemory in applied contexts. The overview chapters synthesize the literature in each field and discuss research, theory, and motivations of future research. The application chapters focus on metamemory in educational contexts (e.g., students) and legal contexts (e.g., eyewitnesses). As such, these chapters highlight the implications of metamemory theory and research to these contexts, and our hope is that they will inspire further research on applied metamemory. Some chapters also offer a novel approach or theory for the field. These chapters provide a new spin on an existing issue. As such, we suspect that these new perspectives will influence the direction of research in each respective field. Finally, the frontiers chapters introduce new areas of metamemory research (e.g., metaintentions, metaforgetting). Of course, every chapter in the handbook offers new frontiers, because each includes future directions, but these chapters in particular describe understudied areas of research.

Closing Remarks

In her treatise on the history of philosophy, Calkins lamented about where to begin. We sympathize, but at the end of this chapter, we must admit that we are more excited about the future of metamemory research, partly because past research has laid key foundations for its epistemology. We now have sophisticated methods and analyses to explore metamemory processes—the methods continue to be adapted, and the analyses questioned at times; nevertheless, they have generated robust evidence and generalizations about metamemory that have supported a deeper understanding of metamemory knowledge, monitoring, and control. In the present chapters, the authors discuss the current evidence and explanations relevant to their domain of interest, which now represent a history of sorts—“a more or less arbitrary point in time” (as per Calkins, 1910). By also sharing ideas on Tauber and Dunlosky

17

Table 1.3  Alternative organization of the chapters in the Oxford Handbook of Metamemory. Overview Chapters Ch 1

A Brief History of Metamemory Research and Handbook Overview

Ch 2

Methodology for Investigating Human Metamemory: Problems and Pitfalls

Ch 4

Judgments of Learning: Methods, Data, and Theory

Ch 5

Introspecting on the Elusive: The Uncanny State of the Feeling of Knowing

Ch 7

Sources of Bias in Judgment and Decision Making

Ch 9

Metacognitive Aspects of Source Monitoring

Ch 14

Metamemory in Comparative Context

Ch 22

The Cognitive Neuroscience of Source Monitoring

Ch 24

Metamemory in Psychopathology

Ch 25

The Development of Metacognitive Knowledge in Children and Adolescents

Ch 26

Monitoring Memory in Old Age: Impaired, Spared, and Aware

Ch 27

Aging and Metacognitive Control

Application in the Classroom and in the Court Room Ch 10

Monitoring and Regulation of Accuracy in Eyewitness Memory: Time to Get Some Control

Ch 11

Metamemory and Education

Ch 16

The Metacognitive Foundations of Effective Remembering

Ch 17

Self-Regulated Learning: An Overview of Theory and Data

Ch 20

Three Pillars of False Memory Prevention: Orientation, Evaluation, and Corroboration

Cutting Edge Perspectives Ch 3

Internal Mapping and Its Impact on Measures of Absolute and Relative Metacognitive Accuracy

Ch 8

The Self-Consistency Theory of Subjective Confidence

Ch 15

Looking Back and Forward on Hindsight Bias

Ch 19

Metacognitive Quality—Control Processes in Memory Retrieval and Reporting

Ch 21

The Ghost in the Machine: Self-Reflective Consciousness and the Neuroscience of Metacognition

Ch 23

Anosognosia and Metacognition in Alzheimer’s Disease: Insights from Experimental Psychology

New Frontiers in Metamemory Research Ch 6

Tip-of-the-Tongue States, Déjà Vu Experiences, and Other Odd Metacognitive Experiences

Ch 12

Prospective Memory: A Framework for Research on Metaintentions

Ch 13

Metamemory and Affect

Ch 18

The Need for Metaforgetting: Insights from Directed Forgetting

exciting new directions for research, we suspect that the chapters will also inspire others to explore metamemory, whether it involves resolving its mysteries or discovering how to enhance its function toward improving people’s learning, memory, and retrieval.

Note

1. Flavell first used the term at the 1971 meeting of the Society for Research in Child Development in Minneapolis, and he subsequently published the term in his 1971 paper. We thank Henry Wellman for sharing the origins of the term as we prepared this chapter.

References

Arbuckle, T. Y., & Cuddy, L. L. (1969). Discrimination of item strength at time of presentation. Journal of Experimental Psychology, 81, 126–131. doi:10.1037/h0027455 Bernstein, D. M., Aßfalg, A., Kumar, R., & Ackerman, R. (2015, this volume). Looking backward and forward on hindsight bias. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Borkowski, J. G., Peck, V. A., Reid, M. K., & Kurtz, B. E. (1983). Impulsivity and strategy transfer:  Metamemory as mediator. Child Development, 54, 459–473. doi:10.1111/j.1467-8624.1983.tb03888.x Brown, A. L. (1975). The development of memory:  Knowing, knowing about knowing, and knowing how to know. In H. W. Reese (Ed.), Advances in child development and behavior (Vol. 10, pp. 103–152). New York, NY: Academic Press. Brown, R., & McNeill, D. (1966). The “tip of the tongue” phenomenon. Journal of Verbal Learning and Verbal Behavior, 5, 325–337. doi:10.1016/S0022-5371(66)80040-3 Bruce, P. R., Coyne, A. C., & Botwinick, J. (1982). Adult age differences in metamemory. The Journal of Gerontology, 37, 354–357. doi:10.1093/geronj/37.3.354 Calkins, M. W. (1910). The persistent problems of philosophy: An introduction to metaphysics through the study of modern systems. (2nd ed.). New York, NY: The Macmillan Company. Castel, A. D., Middlebrooks, C. D., & McGillivray, S. (2015, this volume). Monitoring memory in old age:  Impaired, spared, and aware. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Dixon, R. A. (1989) Questionnaire research on metamemory and aging: Issues of structure and function. In L. W. Poon, D. C. Rubin, & B. A. Wilson (Eds.), Everyday cognition in adulthood and late life (pp. 394–415). New York, NY: Cambridge University Press. Dixon, R. A., Hultsch, D. F., & Hertzog, C. (1988). The metamemory in adulthood (MIA) questionnaire. Psychopharmacology Bulletin, 24, 671–688. Efklides, A. (2015, this volume). Metamemory and affect. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Ernst, A., Moulin, C. J.  A., Souchay, C., Mograbi, D. C., & Morris, R. (2015, this volume). Anosognosia and metacognition in Alzheimer’s Disease:  Insights from experimental psychology. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press.

Fiechter, J. L., Benjamin, A. S., & Unsworth, N. (2015, this volume). The metacognitive foundations of effective remembering. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Dunlosky, J., & Metcalfe, J. (2009). Metacognition. Thousand Oaks, CA: Sage Publications, Inc. Dunlosky, J., Mueller, M. L., & Thiede, K. W. (2015, this volume). Methodology for investigating human metamemory:  Problems and pitfalls. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY: Oxford University Press. Einstein, G. O., & McDaniel, M. A. (1990). Normal aging and prospective memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16, 717–729. doi:10.1037/0278-7393.16.4.717 Flavell, J. H. (1971). First discussant’s comments: What is memory development the development of? Human Development, 14, 272–278. doi:10.1159/000271221 Flavell, J. H. (1979). Metacognition and cognitive monitoring:  A  new area of cognitive-developmental inquiry. American Psychologist, 34, 906–911. doi:10.1037/0003-0 66X.34.10.906 Flavell, J. H., & Wellman, H. M. (1977). Metamemory. In R. V. Kail & J. W. Hagen (Eds.), Perspectives on the development of memory and cognition (pp.  3–34). Hillsdale, NJ: Erlbaum. Fischhoff, B. (1975). Hindsight not equal to foresight:  Effect of outcome knowledge on judgment under uncertainty. Journal of Experimental Psychology:  Human Perception and Performance, 1, 288–299. doi:10.1037//0096-1523.1.3.288 Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with certainty:  The appropriateness of extreme confidence. Journal of Experimental Psychology:  Human Perception and Performance, 3, 552–564. doi:10.1037/0096-1523.3.4.552 Gallo, D. A., & Lampinen, J. M. (2015, this volume). Three pillars of false memory prevention: Orientation, evaluation, & corroboration. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Gilewski, M. J., Zelinski, E. M., & Schaie, K. W. (1990). The memory functioning questionnaire for assessment of memory complaints in adulthood and old age. Psychology and Aging, 5, 482–490. Goldsmith, M. (2015, this volume). Metacognitive quality-control processes in memory retrieval and reporting. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216. doi:10.1037/h0022263 Hart, J. T. (1966). Methodological note on feeling-of-knowing experiments. Journal of Educational Psychology, 57, 347–349. doi:10.1037/h0023915 Hart, J. T. (1967a). Memory and memory-monitoring process. Journal of Verbal Learning and Verbal Behavior, 6, 685–691. doi:10.1016/S0022-5371(67)80072-0 Hart, J. T. (1967b). Second-try recall, recognition, and the memory-monitoring process. Journal of Educational Psychology, 58, 193–197. doi:10.1037/h0024908 Hertzog, C. (2015, this volume). Aging and metacognitive control. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press.

Tauber and Dunlosky

19

Hertzog, H., & Hultsch, D. F. (2000). Metacognition in adulthood and old age. In F. I. M. Craik & T. A. Salthouse (Eds.), The handbook of aging and cognition (2nd ed.) (pp. 417–466). Mahwah, NJ: Lawrence Erlbaum Associates Publishers. Higham, P. A., Zawadzka, K., & Hanczakowski, M. (2015, this volume). Internal mapping and its impact on measures of absolute and relative metacognitive accuracy. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Hollins, T. J., & Weber, N. (2015, this volume). Monitoring and regulation of accuracy in eyewitness memory: Time to get some control. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Izaute, M., & Bacon, E. (2015, this volume). Metamemory in psychopathology. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Johnson, M. K., Raye, C. L., Wang, A. Y., & Taylor, T. H. (1979). Fact and fantasy: The roles of accuracy and variability in confusing imaginations with perceptual experiences. Journal of Experimental Psychology:  Human Learning and Memory, 5, 229–240. doi:10.1037/0278-7393.5.3.229 Koriat, A. (1993). How do we know that we know? The accessibility model of the feeling of knowing. Psychological Review, 100, 609–639. doi:10.1037/0033-295X.100.4.609 Koriat, A. (1997). Monitoring one’s own knowledge during study: A cue utilization approach to judgments of learning. Journal of Experimental Psychology:  General, 126, 349–370. doi:10.1037/0096-3445.126.4.349 Koriat, A., & Adiv, S. (2015, this volume). The self-consistency theory of subjective confidence. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Kornell, N., & Finn, B. (2015, this volume). Self-regulated learning: An overview of theory and data. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Kreutzer, M. A., Leonard, C., & Flavell, J. H. (1975). An interview study of children’s knowledge about memory. Monographs of the Society for Research in Child Development, 40, 1–60. doi:10.2307/1165955 Kuhlmann, B. G., & Bayen, U. J. (2015, this volume). Metacognitive aspects of source monitoring. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Kurtz, B. E., & Borkowski, J. G. (1987). Development of strategic skills in impulsive and reflective children: A longitudinal study of metacognition. Journal of Experimental Child Psychology, 43, 129–148. doi:10.1016/0022-0965(87)90055-5 Lockl, K., & Schneider, W. (2007). Knowledge about the mind: Links between theory of mind and later metamemory. Child Development, 78, 148–167. doi:10.1111/j.14678624.2007.00990.x Markman, E. M. (1977). Realizing that you don’t understand:  A  preliminary investigation. Child Development, 48, 986–992. doi:10.1111/j.1467-8624.1977.tb01257.x McGlynn, S. M., & Kaszniak, A. W. (1991). When metacognition fails: Impaired awareness of deficit in Alzheimer’s disease. Journal of Cognitive Neuroscience, 3, 183–187. doi:10.1162/ jocn.1991.3.2.183 Metcalfe, J., & Schwartz, B. L. (2015, this volume). The ghost in the machine:  Self-reflective consciousness and the

20

neuroscience of metacognition. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Mitchell, K. J. (2015, this volume). The cognitive neuroscience of source monitoring. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Munsterberg, H. (1908). On the witness stand: Essays on psychology and crime. New York: Clark Boardman. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological bulletin, 95, 109–133. doi:10.1037/0033-2909.95.1.109 Nelson, T. O. (Ed.). (1992). Metacognition:  Core readings. Needham Heights, MA: Allyn & Bacon. Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and new findings. In G. H. Bower (Ed.), The psychology of learning and motivation (Vol. 26, pp. 125–173). New York, NY: Academic Press. Pintrich, P. R., Smith, D. A.  F., Garcia, T., & McKeachie, W. J. (1993). Reliability and predictive validity of the motivated strategies for learning questionnaire (MSLQ). Educational and Psychological Measurement, 53, 801–813. doi:10.1177/0013164493053003024 Reder, L. M. (1987). Strategy selection in question answering. Cognitive Psychology, 19, 90–138. doi:10.1016/0010-0285(87)90005-3 Rhodes, M. G. (2015, this volume). Judgments of learning:  Methods, data, and theory. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Robinson, D. N. (1989). Aristotle’s psychology. New  York, NY: Columbia University Press. Sahakyan, L., & Foster, N. L. (2015, this volume). The need for metaforgetting:  Insights from directed forgetting. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Schneider, W., Borkowski, J. G., Kurtz, B. E., & Kerwin, K. (1986). Metamemory and motivation:  A  comparison of strategy use and performance in German and American children. Journal of Cross-Cultural Psychology, 17, 315–336. doi:10.1177/0022002186017003005 Schneider, W., & Löffler, E. (2015, this volume). The development of metacognitive knowledge in children and adolescents. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Schwartz, B. L., & Cleary, A. M. (2015, this volume). Tip-ofthe-tongue states, déjà vu experiences, and other odd metamemory experiences. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY: Oxford University Press. Schwartz, B. K., & Metcalfe, J. (1994). Methodological problems and pitfalls in the study of human metacognition. In J. Metcalfe & A. P. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 93-113). Cambridge, MA:  The MIT Press. Shimamura, A. P., & Squire, L. R. (1986). Memory and metamemory:  A  study of the feeling-of-knowing phenomenon in amnesic patients. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 452–460. doi:10.1037/0278-7393.12.3.452 Smith, R. E. (2015, this volume). Prospective memory: A framework for research on metaintentions. In J. Dunlosky & S. K.

A Brief History of Metamemory Research and Handbook Overview

Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Smith, J. D., Schull, J., Strote, J., McGee, K., Egnor, R., & Erb, L. (1995). The uncertain response in the bottlenosed dolphin (Tursiops truncatus). Journal of Experimental Psychology: General, 124, 391–408. Soderstrom, N. C., Yue, C. L., & Bjork, E. L. (2015, this volume). Metamemory and education. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY: Oxford University Press. Son, L. K., & Kornell, N. (2008). Research on the allocation of study time:  Key studies from 1890 to the present (and beyond). In J. Dunlosky & R. A. Bjork (Eds.), A handbook of memory and metamemory (pp. 333–351). Hillsdale, NJ: Psychology Press. Thomas, A. K., Lee, M., & Hughes, G. (2015, this volume). Introspecting on the elusive:  The uncanny state of feeling of knowing. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY:  Oxford University Press. Tidwell, J. J., Buttaccio, D., Chrabaszcz, J. S., Dougherty, M. R., & Thomas, R. P. (2015, this volume). Sources of bias in judgment and decision making. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New  York, NY: Oxford University Press.

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty:  Heuristics and biases. Science, 185, 1124–1131. doi:10.1126/science.185.4157.1124 Washburn, D. A., Beran, M. J., & Smith, D. (2015, this volume). Metamemory in comparative context. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Wellman, H. M. (1977). Preschoolers’ understanding of memory-relevant variables. Child Development, 48, 1720–1723. doi:10.2307/1128544 Wellman, H. M. (1978). Knowledge of the interaction of memory variables:  A  developmental study of metamemory. Developmental Psychology, 14, 24–29. doi:10.1037//0012-1649.14.1.24 Yates, F. A. (1997). The art of memory. London, England: Pimlico. Zacks, R. T. (1969). Invariance of total learning time under different conditions of practice. Journal of Experimental Psychology, 82, 441–447. doi:10.1037/h0028369 Zelinski, E. M., Gilewski, M. J., & Thompson, L. W. (1980). Do laboratory tests relate to self-assessment of memory ability in the young and old? In L. W. Poon, J. L. Fozard, L. S. Cermak, D. Arenberg, & L. W. Thompson (Eds.), New directions in memory and aging:  Proceedings of the George A. Talland Memorial Conference (pp. 519–544). Hillsdale, NJ: Erlbaum.

Tauber and Dunlosky

21

CH A PT E R

2

Methodology for Investigating Human Metamemory: Problems and Pitfalls

John Dunlosky, Michael L. Mueller, and Keith W.  Thiede

Abstract Research on metamemory focuses on a core set of issues that pertain to people’s beliefs about memory, their monitoring of memory, and their control of memory. To address these issues, researchers have used variants of a small set of methods, which often involve using standard memory methods and then having participants make judgments about their memory or control different phases of learning. Despite the overlap of methods with standard memory research, metamemory research poses some unique problems and pitfalls that can make interpretation of results tricky. The present chapter overviews the core issues addressed by the majority of metamemory research and describes the general methods typically used to address them. Most important, it highlights some of the problems and pitfalls of metamemory research and offers some suggestions on how to solve or sidestep them. Key Words:  metamemory, methodology, judgment accuracy, bias, resolution

In 1994, Schwartz and Metcalfe published the first chapter on methods for investigating metamemory, and our title pays homage to theirs:  “Methodological Problems and Pitfalls in the Study of Human Metacognition.” Given that the same scientific methods are used to investigate metamemory as in any other domain of memory, why would the study of metamemory require a methodological approach that is any different from those covered in a basic course on scientific method? Of course, the standard methodological issues do apply here, and admittedly, the particular problems and pitfalls that arise in metamemory research are not necessarily unique from core issues about rigorous scientific methodology (e.g., validity, reliability, measurement, et cetera). Nevertheless, the implications of some of these methodological issues for metamemory research and interpretation of metamemory data can be subtle. For instance, differences in judgment accuracy may occur between two groups of participants, and one (somewhat reflexive) conclusion is that the groups differ in their

ability to monitor memory. However, such differences in judgment accuracy do not necessitate this conclusion and often do not arise from true differences in monitoring ability. In the present chapter, we touch on some of the problems and pitfalls of conducting metamemory research in a way that highlights and extends the original chapter by Schwartz and Metcalfe (1994). Toward this end, we first describe some of the general questions that are often addressed in metamemory research, and then we discuss approaches to answering them and potential problems that arise in rigorously doing so.

Core Questions

Although the literature on metamemory and metacognition has grown dramatically over the past 40 years, much of the research focuses on a set of key questions. Eight questions are listed in Table 2.1. These questions are interrelated, and the main issues pertain to people’s beliefs (or knowledge) about memory (question 1, Table 2.1), how they 23

Table  2.1 Some Questions That Are Central to Metamemory Research Question 1. What beliefs do people have about how memory operates? 2.  How do people monitor memory? 3.  How accurate is memory monitoring? 4.  Can monitoring accuracy be improved? 5. How is monitoring used to control study and retrieval? 6.  How do people regulate their learning and retrieval? 7.  How does metamemory change across the lifespan? 8. What neurological mechanisms influence metamemory?

monitor their memory (questions 2–4), and how they control it (5, 6). The final two questions in Table 2.1 represent cross-cutting areas where many of the prior questions (1–6) have been systematically investigated with respect to how metamemory changes across the lifespan (question 7) or the neurological mechanisms that underlie metamemory processes (question 8). To put the core questions in context, consider a native English speaker who is studying for an upcoming exam on French vocabulary, for which she needs to learn to produce each French word to the correct English one. This production task is straightforward (e.g., learn to respond “maison” when one hears or sees “house”), but it can be challenging, and when she studies for the exam and later takes it, many aspects of metamemory can contribute to her success. When studying each French-English pair, she may make a judgment about how well she will remember it later on, and if she believes that the pair will be remembered (a monitoring process), she may also decide to not study it again (a control process). Later on during the test, she may also judge her confidence in responses, so when she quickly responds “maison” to “house,” she may think, “I’ve got it.” For responses that she is less confident in (a monitoring process), she may spend more time trying to retrieve another possible response from memory or decide to leave the pair for a while and return to it later (control processes). Even her knowledge about memory can play a critical role. For instance, if like most students she believes that cramming for an exam (i.e., studying only the night before) is the best method (Taraban, Maki, & Rynearson, 1999), then she may cram and miss the opportunity to use spaced practice, which is a much more effective technique for promoting long-term retention (Cepeda, 24

Pashler, Vul, Wixted, & Rohrer, 2006; Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013). This scenario can be used to illustrate the importance of having accurate knowledge about memory, accurately monitoring memory, and effectively controlling it. For instance, if the student is overconfident and often judges that she will remember correct responses when in fact she will not, then she may not return to restudy these pairs and hence will underachieve (for evidence, see Dunlosky & Rawson, 2012). Even if she is very good at judging which pairs she has learned well versus those she has not learned well, if she does not use her accurate judgments to appropriately control learning, she may not learn efficiently and may still underachieve. For instance, she may decide to focus on those pairs that she has judged to have not been well learned, which is an effective control strategy often used by most students (Metcalfe & Kornell, 2005). However, she may decide to just repeat the less well-known pairs to herself just one more time, which is a relatively ineffective rote-memorization strategy that would have little influence on retention. By contrast, if she used an effective strategy—testing herself with feedback—and did so until she had recalled each of these pairs multiple times, then she would greatly increase her chances of remembering them (for details on this successive-relearning technique, see Rawson & Dunlosky, 2014). By highlighting how good metamemory can lead to good learning and memory, this simple scenario also provides an overview of some of the most highly investigated questions about metamemory that are showcased in Table 2.1. As noted earlier, these questions pertain to people’s knowledge about memory, how they monitor memory, and how they control memory. Most of the chapters in this volume focus on one of these aspects of metamemory, with the central differences across chapters pertaining to which learning phase they emphasize—either study, retention, or retrieval. Researchers have systematically investigated people’s monitoring and control of each phase of learning. So with respect to monitoring, some researchers focus on how people monitor their learning during study (e.g., Rhodes, this volume), whereas others focus on how people monitor their retrieval (e.g., Thomas, Lee, & Hughes, this volume; Tidwell, Buttaccio, Chrabaszcz, & Dougherty, this volume). Similarly for control, some researchers focus on how people control their study (e.g., Kornell & Finn, this volume), whereas others focus on how they control their retrieval (e.g., Gallo & Lampinen, this volume; Hollins &

Methodology for Investigating Human Metamemory

Weber, this volume). Understanding this orthogonal nature of metamemory components (e.g., monitoring and control) with the phases of learning is useful for developing a broader view of the entire metamemory field. Toward this end, Nelson and Narens (1990) unified the field of metamemory by introducing a framework that illustrated this orthogonal relationship between the learning phases and metamemory processes. An adapted version of this often reprinted framework is presented in Figure 2.1 (from Dunlosky, Serra, & Baker, 2007), which focuses on monitoring and control processes in particular. We suspect that one reason this framework has had such a major impact is that any researcher who was exploring a particular question (e.g., How do older adults monitor their learning?) could place their work within the context of a broader research framework, and it inspired many to begin exploring new questions that were linked to their ongoing research (e.g., from examining whether aging in adulthood influences monitoring toward understanding how older adults control their learning). Moreover, the framework provides an excellent vehicle to synthesize existing literatures so as to identify gaps in the literature. Excellent examples of such syntheses can be found in the current volume,

such as in the exploration of metamemory development by Schneider and Löffler (this volume) or in the treatise on neurocognitive metamemory by Metcalfe and Schwartz (this volume). Perhaps most important for the present purposes, the Nelsonand-Narens framework offers a great bird’s eye view of most of the metamemory field, and hence it can be used to organize the core questions (Table 2.1) that drive metamemory research.

Major Data Collection Methods

Despite the numerous questions and issues that are being explored in the field (Table 2.1), standard (and easy-to-use) methods to collect metamemory data—and even analyze it—can be used to answer many of these questions. In general, one could begin with a method for collecting memory data: So If you are interested in how people monitor their ongoing study, then begin by using a method in which people study some to-be-learned items (e.g,. paired associates, sentences, or text), whereas if you are interested in retrieval monitoring, then begin by using a method that involves retrieving content that had already been studied. On top of such basic methods, one simply includes a monitoring judgment (top half of Figure 2.1) relevant to the particular phase of learning that is being investigated (for definitions

MONITORING JUDGMENTS OF LEARNING EASE-OF-LEARNING JUDGMENTS

SOURCE-MONITORING JUDGMENTS

ACQUISITION IN ADVANCE OF LEARNING

SELECTION OF KIND OF PROCESSING

CONFIDENCE IN RETRIEVED ANSWERS

FEELING-OF-KNOWING JUDGMENTS

ONGOING LEARNING

TERMINATION OF STUDY ITEM SELECTION

RETRIEVAL

RETENTION MAINTENANCE OF KNOWLEDGE

RETRIEVAL PRACTICE

SELF-DIRECTED SEARCH

SELECTION OF SEARCH STRATEGY

OUTPUT OF RESPONSE

TERMINATION OF SEARCH

CONTROL

Figure 2.1  Adapted from Nelson and Narens’s (1990) metamemory framework. Reprinted from Dunlosky, Serra, and Baker (2007). Reprinted with permission from John Wiley & Sons.

Dunlosky, Mueller, and Thiede

25

of the monitoring judgments, see Table 2.2). For instance, for monitoring of study, participants could make a judgment of learning (JOL) immediately after they study each item, whereas for retrieval Table  2.2 Names and Definitions of  Common Metamemory Judgments and Control Processes Name

Definition

Metamemory Judgments Ease of learning (EOL)

Judgments of how easy to-be-studied items will be to learn.

Judgments of learning (JOLs)

Judgments of the likelihood of remembering recently studied items on an upcoming test.

Feeling of knowing (FOK)

Judgments of the likelihood of recognizing currently unrecallable answers on an upcoming test.

Source monitoring

Judgments made during a criterion test pertaining to the source of a particular memory.

Confidence in answers

Judgments of the likelihood that a response on a test is correct, often referred to as retrospective confidence (RC) judgments.

Control Processes Strategy selection

Selection of strategies to employ when attempting to commit an item to memory.

Item selection

Decision about whether to study an item on an upcoming trial.

Termination of study

Decision to stop studying an item currently being studied.

search strategy

Selection of a particular strategy in order to produce a correct response during a test.

Termination of search

Decision to terminate searching for a response.

26

monitoring, participants could be asked to make a feeling of knowing (FOK) judgment for responses they cannot recall or make a retrospective confidence (RC) judgment about their answers on a test. In either case, researchers often measure criterion performance, so that the accuracy of the focal judgment can be estimated by comparing it to actual performance (as per Hart, 1965). Investigating control processes is often a bit more complicated because it involves (a) allowing participants to control some aspect of learning (e.g., strategy use, study, or retrieval) and (b) measuring the aspect of control relevant to the particular phase of learning. Thus, for instance, instead of presenting items for study at an experimenter rate (the standard method for memory research), researchers interested in the control of study either allow people to study each item as long as they want (and measure study time) or allow people to choose which items they will study on an upcoming trial (and measure which items were selected). For retrieval, researchers may measure how much time people use while trying to retrieve an answer or measure whether people withhold or output a particular response. Given that many general theories of selfregulation propose that monitoring can be used in control (for review and other factors that influence control, see Dunlosky & Ariel, 2011; Kornell & Finn, this volume), control is also investigated by comparing a person’s monitoring judgments during a learning phase with subsequent control decisions. For instance, JOLs made for items during study may be compared to later self-paced restudy for the same items (for reviews, see Dunlosky & Ariel, 2011; Son & Metcalfe, 2000). Beyond monitoring and control, researchers are often interested in people’s beliefs about memory or in the strategies they use during study or retrieval. For the former, questionnaires have been developed to assess people’s beliefs about how memory operates, which have been predominately used to explore developmental changes in beliefs in childhood or later adulthood (e.g., Hertzog, this volume; Löffler & Schnieder, this volume). For strategy use, people may be asked to think aloud while performing a memory task to assess which strategies they are using while studying a set of materials; or more focused, specific reports may be collected that assess which strategy a person is using while studying each item on a list (e.g., Richardson, 1998). Surveys to assess people’s strategy use—often in school—have also been popular and have consistently demonstrated that students report using a mix of both

Methodology for Investigating Human Metamemory

effective and ineffective strategies while they study (for a brief review, see Bjork, Dunlosky, & Kornell, 2013). Just like monitoring judgments, the validity (or accuracy) of these reports has been established in some contexts, yet their validity is still under scrutiny. Nevertheless, the main point here is that just a handful of simple-to-collect judgments, reports, or measures of people’s monitoring and control processes can be used to investigate almost every aspect of metamemory. It turns out that many of the questions presented in Table 2.1 received initial (and relatively definitive) answers when they were first considered by researchers in the 1960s to the 1980s (Tauber & Dunlosky, this volume). What we believe has sustained the growing field of metamemory is that creative variations of the same methods can be used to discover answers to more detailed versions of the general questions in Table 2.1. For instance, cutting-edge studies on JOLs that were conducted 15 years ago largely were demonstrating which factors influenced the judgments, such as establishing that pair relatedness (salt—pepper vs. dog—spoon) influences the judgments. Fifteen years later, modifications of the methods are being used to explore the specific processes that mediate the effects of pair relatedness on the judgments. We expect that new questions will drive metamemory research in the next 20 years, yet variations of the same methods will still be the mainstay of work in this area. Accordingly, it is useful to consider some perspectives and pitfalls that arise when researchers are analyzing and interpreting data from metamemory research.

Data Analyses and Interpretation: Perspectives and Pitfalls

In the current section, we touch on only a subset of the analyses that are conducted on metamemory data and just a few of the pitfalls that one should consider when interpreting outcomes from those analyses. Some of those we cover here are also discussed in other chapters within this volume, but the chapters also highlight other problems that occur when one is investigating a particular kind of judgment or control process. For instance, in discussing source-monitoring judgments, Kuhlmann and Bayen (this volume) describe how measures of source-monitoring accuracy can be influenced by memory for the targets; thus, if researchers do not use measures of source accuracy that are corrected for target memory, the effect of a given variable on a measure of source accuracy cannot be attributed to

its influence on source memory or target memory. Higham, Zawadzka, and Hanczakowski (this volume) provide an in-depth analysis of how understanding mapping functions from internal states to overt judgments is important for interpreting calibration (or bias) scores relevant to judgment accuracy. Our aim here is to just touch on some issues that seem rather general and could be relevant to many kinds of metamemory processes, and we encourage investigators to identify the more specific—but equally important—perspectives and pitfalls that are relevant to their literature of interest.

Questions about Judgment Bases and Judgment Accuracy

To put our present discussion of metamemory judgments in context, consider a framework, presented in Figure 2.2, which was inspired by Brunswik’s lens model and reflects modern inference-based approaches to metamemory judgments (e.g., Johnson & Raye, 1981; Koriat, 1993, 1997; Schwartz, Benjamin, & Bjork, 1997; Serra & Metcalfe, 2009). Inference-based approaches claim that people make judgments by inferring how potential cues are related to criterion performance. A cue here can be almost any stimulus—internally generated or external to the learner—that varies across to-be-judged items. For instance, when fontsize (a “potential cue” in Figure 2.2) is manipulated at study (e.g., some words are in a large font size and others in a smaller font size), people’s JOLs are greater for words presented in the larger font size (Rhodes & Castel, 2008; and see Figure 4.1 of Rhodes, this volume). In this case, people presumably do not have direct access to how strongly a particular to-be-remembered target is stored in memory (for eloquent demonstrations, see Koriat, 1993), but people infer whether (and how strongly) the cue relates to criterion performance to adjust their judgments accordingly. In regard to Figure 2.2, judgment accuracy is a relation between the judgments and criterion performance and is often the construct of focal interest for modern research on metamemory judgments. Importantly, this framework indicates that judgment accuracy will be a function of (at least) two other relationships:  (a)  the degree to which the judgments are based on a potential cue (referred to as cue utilization) and (b)  the degree to which a potential cue relates to performance (referred to as cue diagnosticity). In the example, the cue of font size showed some cue utilization, because judgments were higher for larger than smaller Dunlosky, Mueller, and Thiede

27

Judgment Cue Utilization

Judgment Accuracy

Potential Cue

Criterion Performance Cue Diagnosticity

Figure 2.2  Relations among cues, judgments, and criterion performance that define cue utilization, cue diagnosticity, and judgment accuracy (as pertains to either measures of relative or absolute accuracy).

font-sized words; however, this cue had no diagnosticity, because it was not related to criterion performance. Thus, using this cue to inform judgments would not improve the accuracy of the judgments, and it may even limit their accuracy. Finally, it is important to note that the framework in Figure 2.2 is largely empirical—that is, it simply describes a set of empirical relations among measurable outcomes (one or more cues, judgments, and criterion performance). With this in mind, however, the framework provides a useful tool to investigate metamemory judgments, and it will help us provide a deeper understanding of some of the pitfalls that are covered next.

Differences (or a lack of them) on judgment magnitude.

To answer the question, How do people monitor different aspects of memory?, a common approach is either (a) to manipulate a variable (i.e., a potential cue for the judgments) and estimate its effects on a particular judgment or (b) to evaluate whether people’s judgments are related to some nonmanipulated variable that naturally varies in the environment. If a particular cue is significantly related to a metamemory judgment (whether it is manipulated experimentally or not), then the next step may be to explain why this effect occurs. For instance, consider a case with a nonmanipulated variable—past test performance (e.g., Finn & Metcalfe, 2007; King, Zechmeister, & Shaughnessy, 1980). Finn and Metcalfe (2007) had college students study a list of paired associates (e.g., bucket—salad) and make a JOL for each one and then take a test (e.g., bucket - ?) across all pairs. After this initial 28

study-judgment-test trial, the same paired associates were presented again for study and judgments. Most critically, the JOLs made on the second trial were highly related to recall on the first trial: Judgments on the second trial were higher for previously recalled pairs than ones that were not recall. Finn and Metcalfe (2007, 2008) referred to this as the memory-for-past-test (MPT) heuristic. One explanation for the MPT effect—or any effect of a variable on metacognitive judgments—is that people have a particular belief (or theory) that the given cue influences memory, and it is this belief that drives the influence of the cue on the judgments (for further discussion, see Koriat & Bjork, 2006; Mueller, Dunlosky, Tauber, & Rhodes, 2014). For the MPT effect, people may believe that past performance predicts the future (which seems reasonable), and when they make judgments on the second trial, they try to remember whether or not they had recalled each pair and then adjust their judgment accordingly. Alternatively, the influence of MPT may be implicit and unconscious; one possibility is that recalling a response on the first test makes it easier to process that pair on the second study trial, and this subjective experience of easier processing leads to a higher judgment (cf. mnemonic-based influences; Koriat, 1997). These factors—theory or processing experience—are not mutually exclusive; either or both of them may mediate the effects of prior test performance on subsequent judgments. Recent evidence, however, suggests that the MPT effect largely—if not solely—arises from people’s beliefs about how prior performance relates to future performance (Serra & Ariel, 2014).

Methodology for Investigating Human Metamemory

Our main point here is simply that when a variable influences a metamemory judgment, several factors—such as one’s theory or one’s experience (Koriat & Bjork, 2006)—may be responsible, and further research would be needed to estimate their contribution to those effects. A  difficulty arises, however, when a cue does not influence a judgment. Such null effects are common in the literature on metamemory judgments, but consider an example from a paper on JOLs by Susser, Mulligan, and Besken (2013). During study, they manipulated (a) font size (either 48-point font size or a smaller 18-point font size) and (b)  whether font size was manipulated within a list (a mixed-list condition) or between lists (i.e., a pure-list condition). After studying each item, participants made a JOL. As shown in the left-most bars of Figure 2.3 (labeled “Pure”), font size did not significantly influence JOLs when it was manipulated between participants (pure-list design). A natural interpretation of this outcome would either be that JOLs are simply not sensitive to font size or, even more interesting, that people have no beliefs about the relationship between font size and memory. Besides the standard concern about interpreting null effects, a limitation of these interpretations pertains to the pure-list design itself. Namely, given that participants do not experience both levels of the variable, they obviously

could not contrast those levels (large vs. smaller font sizes) when making their judgments and hence the variable would not be expected to influence them (for detailed rationale, see Ericsson & Simon, 1980). Thus, people may have beliefs about the memorability of large versus smaller font sizes, yet these beliefs would not become manifest unless participants could contrast the two font sizes. What makes the design of Susser et al. (2013) particularly informative is that they also contrasted a pure-list design with a mixed-list design. And, as expected (see right-most bars of Figure 2.3), participants who experienced both font sizes in the mixed-list design demonstrated the standard font-size effect. A  take-home message here is that metamemory judgments appear to be less sensitive to the effects of cues that are manipulated in pure list designs than mixed list designs (e.g., Carroll & Nelson, 1993; Koriat, 1997; Koriat, Bjork, Sheffer, & Bar, 2004; Dunlosky & Matvey, 2001), and they are often insensitive to cues manipulated between subjects (for an exception, see Dunlosky & Matvey, 2001). A direction for future research is to discover exactly why the kind of design (pure or mixed list) moderates the effects of cues on metamemory judgments, which would provide an important breakthrough for the field.

70

Mean Judgment-of-Learning Magnitude

60 50 40

Large Smaller

30 20 10 0

Pure

Mixed List Manipulation

Figure 2.3  Mean JOL for words presented in a large (48-point font) or smaller (18-point font) font size that was manipulated either between lists (bars labeled “Pure”) or within each list (bars labeled “Mixed”). Bars represent standard error of the mean. Values estimated from Susser, Mulligan, and Besken (2013), fi ­ gure 1.

Dunlosky, Mueller, and Thiede

29

Judgment accuracy: measurement.

First, and perhaps foremost, it is essential that researchers do not conflate two kinds of judgment accuracy—absolute and relative accuracy— because they measure different aspects of judgment accuracy and are statistically independent (if not psychologically). Absolute accuracy pertains to the degree to which the magnitude of a person’s judgment matches the magnitude of performance, whereas relative accuracy pertains to the degree to which a person can accurately discriminate between more versus less likely outcomes. Importantly, to measure absolute accuracy, the judgments must be made on the same scale as performance. Performance typically is measured by the percentage correct, so the judgments must be made on a scale of percentage correct to warrant estimating absolute accuracy. Too often researchers will have people make judgments on a scale that does not correspond to performance, and then force correspondence by transforming the judgments. So, a researcher may have participants make JOLs on a 1–7 scale, and then to compute absolute accuracy, they would transform the scale values by dividing by 7 and multiplying by 100 (resulting in percentage judgments). Never do this, because the resulting index cannot be unambiguously interpreted: Do participants really mean 14%, 29% when they make ratings of 1 and 2, respectively? If not (and there is no way to know), then it makes no sense to evaluate judgments of 1 and 2 against the performance yardsticks of 14% and 29%, respectively. Thus, to investigate absolute accuracy by using item-by-item judgments, one must allow participants to make judgments on the same scale as performance. Another straightforward approach is to ask participants to judge the total number of items they will answer correctly (or have answered correctly); these kinds of global (or aggregate) judgments can then be matched to the number of correct responses on the criterion test. By contrast, to measure relative accuracy, the judgment scale can be ordinal and does not need to match the performance scale, because the aim here is to estimate the degree to which the judgments accurately discriminate between differing levels of performance. Absolute and relative accuracy were the focus of early measurement work that was inspired by estimating how well people could forecast the weather. Published in the Monthly Weather Review, Brier’s (1950) article included a score that captured the discrepancies between a forecaster’s judgments and actual outcomes, which was later decomposed by Murphy (1972a, 1972b) into three components 30

pertaining to (a) the difficulty of the task (which on many tasks reflects participants’ knowledge or learning), (b) a calibration index, which measures the distance of an empirical calibration curve from perfect calibration, and (c) resolution, which is a measure of a judge’s ability to discriminate between the likelihood of different outcomes. Like other measures, Murphy’s decomposition of the original Brier score has its limitations (for a review, see Keren, 1991, pp. 256–261). Since this early measurement work, multiple techniques have been proposed to measure absolute accuracy1 and relative accuracy. Some measures of each kind of accuracy are presented in Table 2.3. This list is not exhaustive (for others, see Schraw, 2009), and a thorough comparison of each measure goes well beyond the scope of this introductory chapter. Given the recent and growing popularity of investigating metacognition and the accuracy of metamemory judgments, there has been a surge in critiques about these measures and debates over which one is the best to use. The bottom line, however, was recently captured by Murayam, Sakaki, Yan, and Smith (2014), who noted that “each of the proposed measures has both strengths and weaknesses” (p. 1). Perhaps someday a single measure will be crowned as winner to reside over all others, but we suspect that this day is far off. Until then, what are we to do? One recommendation is to Table  2.3 Measures of  Judgment Accuracy and Relevant Sources Measure

Source

Relative judgment accuracy (aka resolution) Gamma

Nelson (1984)

Gamma*

Benjamin & Diaz (2008)

d-Prime

Benjamin & Diaz (2008)

Da

Masson & Rotello (2009)

Mixed-Effects Model Analysis

Murayam et al. (2014)

Absolute judgment accuracy Bias and Absolute Bias

Pressley, Levin, & Ghatala (1984)

Calibration Curve

Keren (1991)

Calibration Index

Lichtenstein & Fischhoff (1977)

Methodology for Investigating Human Metamemory

estimate accuracy with multiple measures to evaluate whether they converge on the same qualitative outcomes. If no discrepancies occur (e.g., gamma yields one conclusion, and da supports the same one), then one can have a bit more confidence in the outcomes. For instance, in our laboratories, we now estimate relative accuracy using gamma and a signal-detection measure (da, when possible), and at this point, they have always supported the same qualitative conclusions (but see Masson & Rotello, 2009). We typically report gamma (because it has been the most frequently reported measure), but if discrepancies occur, we highly recommend reporting the discrepant measures, as well as discovering the source of the discrepancy. Certainly, further research is needed to identify the strengths and weaknesses of these measures across different contexts.

Judgment accuracy: interpreting differences.

When a difference between groups arises in judgment accuracy (either relative or absolute), a natural conclusion is that the groups differ in their monitoring ability. This conclusion may be correct in many cases, but at least in some, differences in the estimates of judgment accuracy (a derived measure) will not reflect differences in monitoring ability but instead may arise from other factors. Some of these factors may influence judgment accuracy in ways that are uninteresting and hence should be ruled out empirically before drawing firm conclusions. We begin by discussing pitfalls of interpreting bias scores (a common measure of absolute accuracy) and then turn to interpretations of relative accuracy. Bias scores are computed at the level of individual participants and are simply the difference between mean judgments and mean performance. From a descriptive point of view, positive bias scores indicate overconfidence; bias scores of zero indicate perfect accuracy; and negative bias scores indicate underconfidence. However, the issue that we are exploring here is interpreting bias from a construct level that pertains to monitoring ability per se: Does a difference between two groups really indicate a difference in monitoring ability? Should a nonzero bias score be necessarily interpreted as indicating inaccurate monitoring? We touch on answers to each of these questions in turn. Whether two (or more) groups differ in judgment bias is the focus of a great deal of metamemory research. The issue is, after all, intriguing. Do people who know less also show more overconfidence (bias) than people who know more? Do people with

disorders show more overconfidence than others without the disorder? Does bias decrease as children develop into adolescents, and does it increase again with aging in adulthood? These are just a few of the questions that inspire researchers to explore group differences in bias, and the initial answer to these questions typically has been that group differences in bias reflect differences in monitoring ability. For instance, with respect to cognitive aging, researchers initially thought that older adults in their 60s and 70s were more overconfident and hence poorer at monitoring than were their younger counterparts, who showed little bias in their judgments. But is this conclusion about poorer monitoring necessary? One alternative explanation for such differences in bias is that they occur because of differences in criterion performance and not because of differences in monitoring ability. This explanation is best understood by first recognizing that bias is a derived score—composed of both judgments and criterion performance. The judgment in part taps monitoring, whereas criterion performance taps task ability, and depending on their relation, one can find difference in bias that do not reflect monitoring at all. Consider this hypothetical example. Older (60s) and younger (20s) adults study paired associates (e.g., dog—spoon) and make a JOL (0–100) after each one. After studying and making judgments, they then receive a test in which they attempt to recall the correct response (spoon) when shown each cue (dog). On average, bias is 40 for older adults and only 5 for younger adults, indicating that the older adults are more overconfident and presumably have poorer monitoring ability. However, now consider different relationships between judgments and recall, which are shown in Table 2.4. For the first hypothetical scenario in Table 2.4, both groups obtained the same level of recall performance, whereas mean judgments are higher for older than younger adults. In this case, the greater overconfidence of older adults may be interpreted as indicating poorer metamemory ability, because underlying memory does not differ yet their judgments do. By contrast, in the second scenario in Table 2.4, where no differences occur in judgment but only in recall, the data are less conclusive. Certainly, older adults appear to have poorer metamemory ability (as indicated by larger bias), but another possibility is that both groups have equally poor metamemory ability. That is, perhaps neither group really has a good sense about how well they will do, so the group who just happens to perform more closely to their judgments looks as if they are better judges. Dunlosky, Mueller, and Thiede

31

Table  2.4 Mean Judgments and Mean Recall for Hypothetical Experiments on Cognitive Aging Scenario

Mean Judgment

Mean Recall

Bias

Equivalence in criterion performance Older

80

40

40

Younger

45

40

5

Equivalence in judgments Older

60

20

40

Younger

60

55

5

Fortunately, there is at least one way to evaluate between these possibilities. Namely, one can redo the experiment to boost overall recall performance (e.g., by including extra study trials prior to making judgments and administering the criterion test), so that criterion recall is near 90% for younger adults and around 55% for older adults. The rationale here is simply that if age-related deficits do occur in metamemory ability, older adults should still show overconfidence (e.g., make judgments around 70% or higher), whereas younger adults should be more accurate and make judgments around 90%. Outcomes from such research were reported by Connor, Dunlosky, and Hertzog (1997, Experiments 1 and 2). Experiment 1 involved only one study trial prior to the test and demonstrated the standard finding that is consistent with the second scenario in Table 2.4: that is, as compared to younger adults (who demonstrated near-zero bias), older adults were overconfident and showed greater bias in their global predictions (i.e., a single JOL pertaining to how many items will be recalled from the entire list). In Experiment 2, all participants studied the items an extra time, and overall recall performance was boosted for both age groups. Two outcomes were noteworthy:  (a)  the magnitude of judgments did not change much from Experiments 1 to 2, thus suggesting that both groups did not have a good sense about how well they would perform; and (b) as compared to younger adults, older adults showed less bias in Experiment 2! So, in this case, the original discovery of greater bias by older adults did not occur because of age-related deficits in monitoring ability but because of age-related differences in memory. This example is not the only one from the literature in which group differences in bias that 32

were originally thought to reflect actual differences in awareness (or judgment ability) were at least partly an artifact of differences in performance. For instance, bias is often larger for low performers than high performers, which has been dubbed the unskilled-and-unaware effect (Kruger & Dunning, 1999). However, Burson, Larrick, and Klayman (2006) demonstrated that the unskilled-andunaware effect can be moderated by task difficulty. They concluded that “this pattern is consistent with a combination of noisy estimates and overall bias, with no need to invoke differences in metacognitive abilities” (p. 71, Burson et al., 2006). Of course, low performers may actually have difficulties monitoring their performance on some tasks (e.g., Ehrlinger, Johnson, Banner, Dunning, & Kruger, 2008), but the take-home message from research by Burson et al. (2006) is clear: If differences in bias arise between groups that correspond to differences in criterion performance, then one should use caution in attributing those group differences in bias to differences in monitoring ability—further systematic investigation will be needed to evaluate other reasonable explanations for those differences. Other situations may occur in which one should use caution in interpreting bias (or poor absolute accuracy), and these are somewhat more technical and even may undermine one’s confidence in much of the entire enterprise of investigating absolute accuracy. We consider two situations here, but for further details on the difficulties of interpreting absolute accuracy, see Higham et al. (this volume). First, when participants in metamemory studies indicate that there is a 60% chance that they will correctly perform on the criterion task (e.g., make a JOL of 60 for recalling a response on an upcoming test), the assumption is that they mean “there is a 60% chance.” But what if they don’t mean that? That is, instead of thinking that “the frequency of correctly responding on items like this one is 60 out of 100,” they may actually mean “I’m really not that sure if I’ll recall it or not.” If so, then a response of 60 instead indicates uncertainty and not the likelihood of recalling the correct response. Consistent with this example, it is evident that people do not always use the percentage scale to indicate the frequency of performing in the long run but instead use middle-of-the-scale values to indicate general uncertainty (e.g., Dunlosky, Serra, Matvey, & Rawson, 2005). Second, even if people have a completely accurate internal state that reflects the true likelihood of an outcome, noise in the judgmental

Methodology for Investigating Human Metamemory

process can artifactually produce bias (Wallsten & González-Vallejo, 1994). To understand this idea more generally, consider someone who has an internal state that indicates 100% success, yet that state also is paired with internal noise that is normally distributed around it; that is, it extends above and below the internal state of “100.” Given that percentage scales are truncated at 100, any noise that is added to the judgment process will drive the judgment response downward and hence could result in underconfidence. And, at the other end of the scale, this same kind of psychological noise would result in overconfidence. Minimally, one might never expect people to be perfectly calibrated—or show no bias—when performance is at either extreme of the scale. More generally, however, this possibility implies that even more caution should be used when interpreting differences in bias scores when groups differ dramatically in performance and hence their judgments may be differentially influenced by psychological noise. It is critical to emphasize the term caution, because differences in bias, even under nonideal circumstances (e.g., when groups differ in criterion performance), may reflect actual differences in monitoring ability. But further investigation would be needed if researchers want to establish confidently that actual differences in monitoring ability exist. It turns out that interpreting group differences in estimates of relative accuracy also can be difficult, because differences here can also be produced by multiple factors that do not necessarily reflect differences in monitoring ability. These pitfalls can be understood by referring to Figure 2.2 and the following scenario. Imagine two groups studying paired associates and making a JOL after each one, and both groups taking a criterion test of paired associate recall; one group is told to study the pairs in any way they see fit, and the other group is told to use a new strategy for studying. The researchers predicted that the new strategy would improve relative accuracy, and as expected, it was significantly higher (as measured by a within-participant gamma correlation) for the strategy group (Mean = .65) than for the standard-instruction group (Mean = .30). This difference may indicate that the new strategy improved people’s ability to monitor their learning, but consider these alternative interpretations. First, the new strategy itself may produce different cues that are more diagnostic of criterion performance. Thus, although the strategy would be useful for boosting judgment accuracy, the strategy did so by yielding better cues and not by improving people’s

ability to evaluate learning. Second, the new strategy may alter criterion performance (e.g., making it more reliable or producing more within-participant variability), thus allowing the judgments to more highly correlate with it. In this case, the two groups are using the available cues in the same manner, but the strategy group is more accurate because that strategy impacted criterion performance. Versions of both of these first alternatives apparently explain why delaying people’s JOLs leads to better judgment accuracy (as compared to immediate JOLs; for details on delayed JOLs and the delayed-JOL effect, see Rhodes, this volume). That is, delaying JOLs after study produces different and more predictive cues (Nelson, Narens, & Dunlosky, 2004) and also affects later criterion performance (Spellman & Bjork, 1992; for a review of evidence relevant to both factors, see Rhodes & Tauber, 2011). The former may be construed as boosting people’s ability to make better judgments, whereas the latter effects have nothing to do with making judgments at all. Even in the former case, one would not want to conclude that people have better ability to monitor their memory per se, but instead that the technique produced better cues with which to base judgments. Another factor that can undermine interpretation of relative accuracy is correct guessing on the criterion test. For instance, if in the previous example the new strategy decreased the likelihood that participants would guess the correct answer, then their judgment accuracy would increase even if their judgments were not influenced. To understand why, consider two people who accurately judged that they had not learned the pair “dog–spoon.” If one person correctly guesses (whereas the other does not), the correct guess would inadvertently lower judgment accuracy—that is, the person literally did not learn the pair yet answered correctly by chance. The contribution of correct guessing on relative accuracy has been investigated, and the results are consistent with the idea that correct guessing can limit it. In particular, Schwartz and Metcalfe (1994) examined the relative accuracy of FOK judgments, and they plotted FOK accuracy as a function of the number of alternatives on the criterion recognition tests. As expected, as the number of alternatives increased (and hence correct guessing was less likely), FOK accuracy also increased. Thiede and Dunlosky (1994, Experiment 4) investigated JOL accuracy and manipulated the number of alternatives on the criterion recognition test, and they also found higher JOL accuracy for the test with more alternatives. Moreover, they had participants state when Dunlosky, Mueller, and Thiede

33

they were guessing on the test, and when “guesses” for each participant were removed from the analysis, JOL accuracy substantially increased for both tests. In summary, group differences in judgment accuracy—either for absolute or relative accuracy—can occur for many reasons, some of which are relatively uninteresting. Fortunately, if researchers are aware of these factors, their contribution (at least for some factors) can be empirically estimated, so that they can be ruled out as explanations for otherwise interesting effects. Thus, an important area for future research will be to discover the methods and relevant outcomes that are needed to firmly establish that any group differences in relative or absolute accuracy are due to differences in metamemory monitoring and not to co-occurring differences in memory performance.

Questions About Control Processes

Estimating judgment accuracy involves relating judgments to criterion performance, and although many approaches have been used to estimate their relationship, the approaches are relatively standard and similar across all phases of learning. By contrast, many approaches and methods have been used to measure control processes, and these often change as one moves from control processes at encoding to those used at retrieval. One reason for the proliferation of methods is simply that many aspects of learning and retrieval can be controlled. For instance, during encoding, one can control (a) how long one studies a particular item, (b) whether one chooses to study an item, (c) the specific learning strategy (e.g., using imagery or repetition) chosen to study any given item, (d) the study schedule for a particular item (e.g., spacing vs. massing an item), and (e) whether to restudy items or not. During retrieval, one can control (a) how long to persist in trying to retrieve a particular sought after response, (b) whether to continue retrieving after one response has been retrieved, (c) whether to output a response, and (d) the strategies used to choose alternatives during a multiple-choice test. This list does not exhaust the control processes that can be used, nor does Table 2.1 include all the questions about control. Some questions that should also be mentioned are, Do people optimally control their learning and retrieval? How can the effectiveness of people’s control processes be improved? And, how much does improving judgment accuracy lead to subsequent improvements in control processes? We cannot cover all the pitfalls that can occur when measuring the different kinds of control process or 34

even provide a thorough overview of problems that can arise when investigators attempt to answer these questions. Thus, we decided to touch on two questions that appear to be central to the relationship between memory monitoring and control.

Is monitoring used to control memory?

This question is an important one for metamemory research, partly because it links two of the key components of metamemory (monitoring and control) but also because of its meta-theoretical implications. Namely, a major criticism of metamemory research is that monitoring itself is epiphenomenal. From this stance, the thoughts (or introspections) that we have about our ongoing memory are a byproduct of the memory system, much like a fever is a byproduct of the flu (for detailed rebuttals, see Nelson, 1996; Lieberman, 1979). Even if the subjective experience of monitoring is a byproduct, proponents of a metamemory approach would gain ground if monitoring itself were used to consciously control subsequent memory processes. An approach to evaluate the hypothesis that monitoring affects control (Nelson & Leonesio, 1988) often involves measuring both metacognitive judgments and subsequent control processes. A significant relationship between them would provide some evidence consistent with this hypothesis, and studies have established these relationships between judgments and control processes at the different phases of learning (for reviews, see Dunlosky & Ariel, 2011; Singer & Tiede, 2008). For instance, JOLs made at encoding are typically negatively related to subsequent study time, a finding that is consistent with the hypothesis that people use their monitoring to spend more time studying items that they believe they have learned less well. At test, RC judgments are often positively related to the time taken to recall answers (that end up as omission errors), which is consistent with the hypothesis that people use their retrieval monitoring to control their retrieval. As in all psychological studies, these correlational outcomes are important in that they support the causal hypothesis that monitoring is used to control memory. Correlations do not necessitate causal conclusions, however, and in these cases, other variables may be responsible for the relationships. One variable in particular—item difficulty—provides a viable alternative that may be responsible for the judgment-control relationship in a manner that would imply that conscious monitoring per se is not a contributor. Consider the negative relationship

Methodology for Investigating Human Metamemory

between JOLs and self-paced study time:  Perhaps item difficulty unconsciously drives self-paced study, such as if the cognitive system requires more time to process difficult-to-learn items than easier ones. Given that item difficulty is inversely related to JOLs, then JOLs would be related to study time even though JOLs (or monitoring of learning) was not consciously used to control learning. That is, study time is still a controlled process, but the control itself is not metacognitively driven. Given such third-variable explanations of judgment-control relationships, an important step for future research will be to conduct experiments that more firmly establish the link between the judgments and control processes (for one example, see Metcalfe & Finn, 2008).

Does improving monitoring accuracy also improve the effectiveness of control?

Preliminary evidence suggests that improving monitoring accuracy does improve the control of both learning (e.g., Dunlosky & Rawson, 2012; Thiede, Anderson, & Therriault, 2003) and retrieval (Koriat & Goldsmith, 1996). Nevertheless, many issues remain unresolved, such as the conditions under which accuracy will matter the most and how much accuracy is enough to matter, to name only two. Systematically answering this question—using converging operations, a wide variety of materials and learners, et cetera—is an important challenge for future research in the field, so we want to emphasize one pitfall that can yield misleading conclusions. To understand this pitfall, consider the issue in the context of monitoring of learning during the encoding phase. The idea here is not that improving monitoring accuracy will magically boost criterion performance, but instead, the underlying assumption is that improving monitoring accuracy can boost criterion performance if and only if people are given a chance to use their monitoring to effectively control subsequent study. This assumption is essential yet subtle, but an example may help to clarify. Imagine a student who studies a list of paired associates and then makes JOLs for each one, and imagine that the students’ JOLs demonstrate perfect relative accuracy. Now compare that student to another one whose JOLs show chance relative accuracy. After they make JOLs, the former student cannot benefit from the accurate JOLs if she is not given a chance to use them in a manner that could actually boost her memory performance for those items that she accurately judged as being not well learned. So, if the students were not given a chance

to restudy items (or restudy time was limited so little else could be learned), then the former student could not be expected to capitalize on the accurate JOLs and boost her criterion performance. Thus, when exploring the degree to which judgment accuracy is related to effective control, it is essential to give participants a chance to use their judgments in a manner that could boost their learning.

Concluding Comments

In this chapter, we touched on some of the problems and pitfalls that do occur when investigators seek to answer empirical and theoretical questions about human metamemory. The pitfalls that we highlighted are relatively common in the field, yet they are not exhaustive of all those that do occur and do not even touch on the challenges of investigating nonhuman metamemory (for details, see Washburn, Beran, & Smith, this volume). For these reasons, we encourage young investigators to read widely in their areas of interest to uncover the subtleties of collecting, analyzing, and interpreting metamemory data. We recommend beginning with target articles in this handbook, because most authors covered standard methods used in their areas and pointed out some common pitfalls for interpreting data as well. Most of all, we hope that this chapter—in conjunction with the original chapter by Schwartz and Metcalfe (1994)— will help investigators sidestep some pitfalls of conducting metamemory research and will foster the development of interpretable data and solid conclusions that are the fuel for progressive research programs.

Note

1. Absolute accuracy is a general term that pertains to bias (the difference between mean judgment and mean performance) and calibration (which is often a weighted average of differences between each bin of judgments and performance as reflected in a calibration curve).

References

Benjamin, A. S., & Diaz, M. (2008). Measurement of relative metamnemonic accuracy. In J. Dunlosky & R. A. Bjork (Eds.), Handbook of memory and metamemory (pp. 73–94). New York, NY: Psychology Press. Bjork, R. A., Dunlosky, J., & Kornell, N. (2013). Self-regulated learning: Beliefs, techniques, and illusions. Annual Review of Psychology, 64, 417–444. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3. Burson, K. A., Larrick, R. P., & Klayman, J. (2006). Skilled or unskilled, but still unaware of it:  How perceptions of difficulty drive miscalibration in relative comparisons. Journal of Personality and Social Psychology, 90, 60–77. doi: 10.1037/022-3514.90.1.60

Dunlosky, Mueller, and Thiede

35

Carroll, M., & Nelson, T. O. (1993). Effect of overlearning on the feeling of knowing is more detectable in within-subject than in between-subject designs. The American Journal of Psychology, 106, 227–235. Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132, 354–380. Connor, L., Dunlosky, J., & Hertzog, C. (1997). Age-related differences in absolute but not relative metamemory accuracy. Psychology and Aging, 12, 50–71. Dunlosky, J., & Ariel, R. (2011). The influence of agenda-based and habitual processes on item selection during study. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 37, 899–912. doi: 10.1037/a0023064 Dunlosky, J., & Matvey, G. (2001). Empirical analysis of the intrinsic-extrinsic distinction of judgments of learning (JOLs):  Effects of relatedness and serial position on JOLs. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 27, 1180–1191. Dunlosky, J., & Rawson, K. A. (2012). Overconfidence produces underachievement:  Inaccurate self-evaluations undermine students’ learning and retention. Learning and Instruction, 22, 271–280. Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students’ learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14, 4–58. Dunlosky, J., Serra, M., & Baker, J. M. C. (2007). Metamemory applied. In F. Durso et al. (Eds.) Handbook of applied cognition. 2nd ed. (pp. 137–159). New York, NY: John Wiley & Sons, Ltd. Dunlosky, J., Serra, M. J., Matvey, G., & Rawson, K. A. (2005). Second order judgments about judgments of learning. The Journal of General Psychology, 132, 335–346. doi: 10.3200/ GENP.132.4.335-346 Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational Behavior and Human Decision Processes, 105, 98–121. Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological Review, 87, 215–251. doi:  10.1037/ 0033-295X.87.3.215 Finn, B., & Metcalfe, J. (2007). The role of memory for past test in the underconfidence with practice effect. Journal of Experimental Psychology, Learning, Memory, and Cognition, 33, 238–244. doi: 10.3758/PBR.15.1.174 Finn, B., & Metcalfe, J. (2008). Judgments of learning are influenced by memory for past test. Journal of Memory and Language, 58, 19–34. doi: 10.1016/j.jml.2007.03.006 Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216. Johnson, M. K., & Raye, C. L. (1981). Reality monitoring. Psychological Review, 88, 67–85. Keren, G. (1991). Calibration and probability judgments:  Conceptual and methodological issues. Acta Psychologica, 77, 217–273. King, J. F., Zechmeister, E. B., & Shaughnessy, J. J. (1980). Judgments of knowing:  The influence of retrieval practice. American Journal of Psychology, 93, 329–343. Koriat, A. (1993). How do we know that we know? The accessibility model of the feeling of knowing. Psychological Review, 100, 609–639.

36

Koriat, A. (1997). Monitoring one’s own knowledge during study:  A  cue-utilization approach to judgments of learning. Journal of Experimental Psychology:  General, 126, 349–370. Koriat, A., & Bjork, R. A. (2006). Illusions of competence during study can be remedied by manipulations that enhance learners’ sensitivity to retrieval conditions at test. Memory and Cognition, 34, 959–972. doi: 10.3758/BF03193244 Koriat, A., Bjork, R. A., Sheffer, L., & Bar, S. K. (2004). Predicting one’s own forgetting: The role of experience-based and theory-based processes. Journal of Experimental Psychology: General, 133, 643–656. doi: 10.1037/0096-344 5.133.4.643 Koriat, A., & Goldsmith, M. (1996). Monitoring and control processes in the strategic regulation of memory accuracy. Psychological Review, 103, 490–517. Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77, 1121–1134. Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? The calibration of probability judgments. Organizational Behavior and Human Performance, 20, 159–193. Lieberman, D. A. (1979). Behaviorism and the mind:  A  (limited) call for a return to introspection. American Psychologist, 34, 319–333. Masson, M. E. J., & Rotello, C. M. (2009). Sources of bias in the Goodman–Kruskal gamma coefficient measure of association: Implications for studies of metacognitive processes. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 35, 509–527. Metcalfe, J., & Finn, B. (2008). Evidence that judgments of learning are causally related to study choice. Psychonomic Bulletin & Review, 15, 174–179. Metcalfe, J., & Kornell, N. (2005). A region of proximal learning model of study time allocation. Journal of Memory and Language, 52, 463–477. Mueller, M. L., Dunlosky, J., Tauber, S. K., & Rhodes, M. G. (2014). The font-size effect on judgments of learning (JOLs): Does it exemplify the effect of fluency on JOLs or reflect people’s beliefs about memory? Journal of Memory and Language, 70, 1–12. Murayam, K., Sakaki, M., Yan, V. X., & Smith, G. M. (2014). Type I  error inflation in the traditional by-participant analysis to metamemory accuracy: A generalized mixed-effects model perspective. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40, 1287–1306. doi: 10.1037/a0036914 Murphy, A. H. (1972a). Scalar and vector partitions of the probability score: Part 1. Two state situation. Journal of Applied Meteorology 11, 273–282. Murphy, A. H. (1972b). Scalar and vector partitions of the probability score:  Part  2. N-state situations. Journal of Applied Meteorology 11, 1183–l192. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological Bulletin, 95, 109–133. Nelson, T. O. & Leonesio, R. J. (1988). Allocation of self-paced study time and the “labor-in-vain effect.” Journal of Experimental Psychology:  Learning, Memory, and Cognition, 14, 676–686. Nelson, T. O. (1996). Consciousness and metacognition. American Psychologist, 51, 102–116.

Methodology for Investigating Human Metamemory

Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and new findings. In G. H. Bower (Ed.), The Psychology of learning and motivation:  Advances in research and theory (pp. 125–173). San Diego, CA: Academic Press. Nelson, T. O., Narens, L., & Dunlosky, J. (2004). A revised methodology for research on metamemory:  Pre-judgment recall and monitoring (PRAM). Psychological Methods, 9, 53–69. Pressley, M., Levin, J. R., & Ghatala, E. S. (1984). Memory strategy monitoring in adults and children. Journal of Verbal Learning and Verbal Behavior, 23, 270–288. Rawson, K. A., & Dunlosky, J. (2014). Bang for the buck:  Supporting durable and efficient student learning through successive relearning. In M. A. McDaniel, R. F. Frey, S. M. Fitzpatrick, & H. L. Roediger (Eds.), Integrating cognitive science with innovative teaching in STEM disciplines. St. Louis, MO: Washington University Libraries. Rhodes, M. G., & Castel, A. D. (2008). Memory predictions are influenced by perceptual information: Evidence for metacognitive illusions. Journal of Experimental Psychology:  General, 137, 615–625. doi: 10.1037/a0013684 Rhodes, M. G., & Tauber, S. K. (2011). The influence of delaying judgments of learning on metacognitive accuracy:  A  meta-analytic review. Psychological Bulletin, 137, 131–148. Richardson, J. T. E. (1998). The availability and effectiveness of reported mediators in associative learning: A historical review and an experimental investigation. Psychonomic Bulletin & Review, 5, 597–614. Schraw, G. (2009). Measuring metacognitive judgments. In D. J. Hacker, J. Dunlosky, & A. Graesser (Eds.), Handbook of metacognition in education (pp. 415–429). New  York, NY: Routledge. Schwartz, B. L., Benjamin, A. S., & Bjork, R. A. (1997). The inferential and experiential bases of memory. Current Directions in Psychological Science, 6, 132–137. doi: 10.1111/1467-8721. ep10772899 Schwartz, B. L., & Metcalfe, J. (1994). Methodological problems and pitfalls in the study of human metacognition. In J.

Metcalfe & A. P. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 93–113). Cambridge, MA: MIT Press. Serra, M. J., & Ariel, R. (2014). People use the memory for past-test heuristic as an explicit cue for judgments of learning. Memory & Cognition, 42, 1260–1272. Serra, M. J., & Metcalfe, J. (2009). Effective implementation of metacognition. In D. Hacker, J. Dunlosky, & A. Graesser (Eds.), Handbook of metacognition in education (pp. 278–298). New York, NY: Routledge. Singer, M., & Tiede, H. L. (2008). Feeling of knowing and duration of unsuccessful memory search. Memory & Cognition, 36, 588–597. Son, L. K., & Metcalfe, J. (2000). Metacognitive and control strategies in study-time allocation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 204–221. doi: 10.1037/0278-7393.26.1.204 Spellman, B. A., & Bjork, R. A. (1992). When predictions create reality:  Judgments of learning may alter what they are intended to assess. Psychological Science, 3, 315–316. Susser, J. A., Mulligan, N. W., & Besken, M. (2013). The effects of list composition and perceptual fluency on judgments of learning. Memory & Cognition, 41, 1000–1011. doi: 10.3758/s13421-013-0323-8 Taraban, R., Maki, W. S., Rynearson, K. (1999). Measuring study time distributions:  Implications for designing computer-based courses. Behavioral Research Methods, Instruments and Computers, 31, 263–69. Thiede, K. W., Anderson, M. C. M., & Therriault, D. (2003). Accuracy of metacognition monitoring affects learning of texts. Journal of Educational Psychology, 95, 66–73. doi: 10.1037/0022-0663.95.1.66 Thiede, K. W., & Dunlosky, J. (1994). Delaying students’ metacognitive monitoring improves their accuracy at predicting their recognition performance. Journal of Educational Psychology, 2, 290–302. Wallsten, T. S., & González-Vallejo, C. (1994). Statement verification: A stochastic model of judgment and response. Psychological Review, 101, 490–504. doi:  10.1037/0033-29 5X.101.3.490

Dunlosky, Mueller, and Thiede

37

CH A PT E R

3

Internal Mapping and Its Impact on Measures of Absolute and Relative Metacognitive Accuracy

Philip A. Higham, Katarzyna Zawadzka, and Maciej Hanczakowski

Abstract Research in decision making and metacognition has long investigated the calibration of subjective probabilities. To assess calibration, mean ratings on a percentage scale (e.g., subjective likelihood of recalling an item) are typically compared directly to performance percentages (e.g., actual likelihood of recall). Means that are similar versus discrepant are believed to indicate good versus poor calibration, respectively. This chapter argues that this process is incomplete: it examines only the mapping between the overt scale values and objective performance (mapping 2), while ignoring the process by which the overt scale values are first assigned to different levels of subjective evidence (mapping 1). The chapter demonstrates how ignoring mapping 1 can lead to conclusions about calibration that are misleading. It proposes a signal detection framework that not only provides a powerful method for analyzing calibration data, but also offers a variety of measures of relative metacognitive accuracy (resolution). Key Words:  calibration, metacognitive accuracy, signal detection theory, resolution, AUC, gamma

Like all the sciences, psychology relies on accurate measurement to advance the discipline. However, in psychology, such accuracy involves more than merely reducing random error that is likely to be present in any measurement tool, and it has been known since psychology’s early beginnings that experimental participants’ strategies need to be controlled if psychological measurement is to be free of bias (e.g., for an older review on the topic, see Poulton, 1979). Because participants in psychology experiments are active agents, forming opinions and inferences about what is being measured and attempting to strategically alter the outcome, psychology is unique among the sciences; strategic control does not occur with tissue samples in biology, elements in chemistry, or atomic particles in physics except perhaps in science fiction! It is our premise in this chapter that whereas most if not all metacognitive theorists are aware of these measurement issues, many do not attempt

to control them in their own research. We contend that the failure to do so can have profound effects on the conclusions drawn from research on metacognition, potentially leading to psychological theories unnecessarily being developed to account for what amounts to measurement bias. In what follows, we demonstrate how these biases can obscure the conclusions researchers draw from their results. In the first part of the chapter we deal with research on absolute metacognitive accuracy. We first outline what absolute metacognitive accuracy is, how research on it is typically conducted, provide evidence for the potential for measurement bias in such research, and flesh out the implications for metacognition. Then, using our research on the underconfidence-with-practice (UWP; e.g., Koriat, 1997) effect as an example, we consider how metacognitive tasks that differ only superficially (e.g., binary decisions versus 100-point scales) can affect calibration outcomes. In the final part of the 39

chapter, we outline biases that can affect measures of metacognitive resolution and suggest ways of minimizing them.

Research on Absolute Metacognitive Accuracy

In a typical experiment examining absolute metacognitive accuracy, experimental participants are asked to make a series of metacognitive judgments across multiple trials. Absolute metacognitive accuracy pertains to whether self-rated performance probabilities (usually with a percentage or proportion scale) correspond to actual performance probabilities. Performance ratings can be either prospective (i.e., how good performance will be in the future) or retrospective (i.e., how good performance was in the past). For example, participants may be asked to study cue-target pairs and then judge prospectively (at study) how likely it is that they would be able to remember a target word when prompted with a cue word in a test 10 mins from now, a rating referred to as a judgment-of-learning (JOL; e.g., Meeter & Nelson, 2003). Conversely, participants may be asked to judge retrospectively how likely it is that an item just recalled on the test is actually a target, a rating referred to as retrospective confidence (e.g., Luna, Higham, & Martin-Luengo, 2011). In either case, the assigned percentage-scale values are then directly compared to the actual probability of recall to assess calibration. For example, if items assigned JOLs of 60% actually have a mean recall probability of 60%, then participants are said to be well calibrated or realistic. On the other hand, if the recall likelihoods for that subset of items was 40% or 80%, then participants would be overconfident (OC) or underconfident (UC), respectively. All of this seems intuitive enough, but let us examine the assumptions underlying this process in more detail. Figure 3.1 shows two mappings that we believe are necessary for accurately assessing absolute metacognitive accuracy. The first mapping is in the hands of participants; they must translate some subjective feeling on an internal dimension (e.g., subjective evidence for later recall for JOLs) into a number or category on the scale that has been provided for them by the experimenter. The second

Internal Dimension

Mapping 1

mapping is more in the hands of the experimenter. It involves comparing participants’ ratings (scale values) to their actual accuracy. Virtually all energy expended in modern metacognitive research on calibration has focused on the second mapping with virtually no consideration of the first. This oversight is problematic because, as we will show, the scale values that are assigned as a result of mapping 1 are determined not only by the sheer magnitude of the internal dimension but by other, response-level factors. That being so, items with the same magnitude on the internal dimension can lead researchers to conclude that participants are OC, UC, or realistic depending on how the first mapping is performed. Alternatively, a second problem that can arise is that very different mappings of the internal dimension onto the scale values can result in the erroneous conclusion that calibration is the same. To illustrate the second problem, consider the three panels of Figure 3.2, all of which represent a perfectly calibrated participant if typical assessment procedures are adopted. That is, items assigned values of 50% are 50% likely to be recalled, whereas items assigned 60% are 60% likely to be recalled, and so on. In panel A, the levels of the internal dimension that are mapped onto the scale values are spaced at equal intervals that correspond to the scale values. On the other hand, in panel B, a completely different first mapping also produces perfect calibration, only in this case, the values on the internal dimension are more moderate and have a narrower range than the case illustrated in panel A. Finally, in panel C, the participant assigns scale values below 50% to very low levels on the internal dimension, whereas scale values 50% and above are reserved for very high levels on the internal dimension. Critically, the three cases constitute completely different psychological scenarios and yet, because the focus is on mapping 2 to the detriment of mapping 1, typical calibration measures would be unable to distinguish between them. We presume that panel A represents the scenario that is implicitly intended when metacognitive theorists refer to good calibration. Indeed, as we shall explain, the panel A  scenario is the only one that can be deemed as representing good calibration, but it is one that is

Scale Values

Mapping 2

Accuracy

Figure 3.1  The two mappings involved in assessing calibration. Mapping 2 is the focus of modern metacognitive research. Mapping 1, which is under participants’ control, is virtually ignored despite the fact that factors other than the magnitude of the internal dimension can affect it. These factors can ultimately distort whether participants are assessed to be realistic, OC, or UC.

40

Rel ative and Absolute Accuracy

(a)

Low

Internal Dimension

High

Mapping 1

0%

25%

50% 75% Scale Values

100%

25%

50% 75% Actual Accuracy

100%

Mapping 2

0% (b)

Low

Internal Dimension

High

Mapping 1

0%

25%

50% 75% Scale Values

100%

0%

25% 50% 75% Actual Accuracy

100%

Low

Internal Dimension

Scale Values as Response Criteria or Cutoffs

Mapping 2

(c)

High

Mapping 1

0%

25%

50% 75% Scale Values

0%

25%

50% 75% Actual Accuracy

also be computed such as Brier’s score (the squared difference between accuracy and the assigned scale value summed across items; Brier, 1950). Also, as an alternative to single measures of absolute metacognitive accuracy, several means might be computed for different subsets of items such as those assigned 0–9% versus 10–19% versus 20–29%, and so on. These means can then be plotted against the corresponding mean accuracy for each subset of items to create a calibration curve (see Figure 3.3 for an example). Alternative measures exist that we won’t exhaustively list here. The important point for present purposes is that none of these measures accounts for variability in mapping 1, so the problems outlined apply to all of them.

100%

Mapping 2

100%

Figure 3.2  Three potential combinations of mapping 1 (which varies) and mapping 2 (which is constant). In panel A, equally spaced levels of the internal dimension are assigned equally spaced scale values that are also evenly spaced across the whole dimension. In panel B, all internal dimension levels are moderate and have a narrower range. In panel C, internal dimension values are either extremely low and assigned scale values below 50% or extremely high and assigned scale values 50% and above. The three cases would be indistinguishable in typical calibration research.

unlikely to exist in reality because of phenomena such as the spacing bias. It is worth noting at the outset that the problems we are identifying exist regardless of the particular absolute accuracy measure being computed; we know of no common measure currently in use that incorporates both mappings. In most studies, a single mean is computed across all values on the scale and that value is compared to objective performance (e.g., Castel, McCabe, & Roediger, 2007; Scheck & Nelson, 2005). Somewhat more sophisticated single measures of absolute accuracy can

Historically, not all research on calibration has ignored mapping 1.  For example, Ferrell and colleagues (e.g., Ferrell, 1994, 1995; Ferrell & McGoey, 1978, 1980; Ferrell & Rehm, 1980; Smith & Ferrel, 1983) proposed the decision-variable partition model that incorporated mapping 1 into calibration by quantifying it using methodology adopted from signal detection theory. Using this model, Ferrell and McGoey (1980) were able to simulate aspects of calibration performance such as the hard-easy effect (e.g., Lichtenstein & Fischhoff, 1977). Others in the judgment and decision-making literature have also quantitatively incorporated mapping 1 into their models (e.g., Gu & Wallsten, 2001; Jang, Wallsten, & Huber, 2012; Wallsten & Gonzalez-Vallejo, 1994). In all these cases, mapping 1 is incorporated into the calibration model by treating different levels of the metacognitive scale (10, 20, 30%, and so on) as confidence criteria or cutoffs. Treating scale values as response criteria in this way is a basic premise of signal detection theory and is essentially unquestioned in some areas of cognitive psychology such as recognition memory. To understand the concept of confidence criteria as applied to calibration, consider the top panel of Figure 3.4 labeled “control.” (Ignore the bottom panel for the moment.) The figure depicts a scenario corresponding to an experiment involving JOL ratings. It shows an internal dimension over which later unrecalled (U)  and recalled (R)  items are normally distributed. The internal dimension is a decision axis that is meant to reflect subjective evidence that participants consider in the JOL task. It can assimilate the influence of different JOL cues, evidence from memory, current context, metacognitive beliefs, and so on. All these sources

Higham, Zawadzka, and Hancz akowski

41

100 90 80

Accuracy (%)

70 60 50

52

40

201

30 168

20 10 0

152

101

30

40

163

87 98

72

50 60 JOL (%)

70

130

96

0

10

20

80

90

100

Figure 3.3  Calibration curve adapted from the data of Hanczakowski et al. (2013; Experiment 1, cycle 1). To create the curve, accuracy is plotted as a function of JOL level (in this case, in increments of 10). The diagonal line, presented for reference, shows perfect calibration: for any point of this line, JOLs and accuracy are equal. On the empirical calibration curve, points above the diagonal (JOLs between 0 and 30% in this example), for which JOLs are lower than accuracy, reveal underconfidence, while points below the diagonal (JOLs between 40 and 100%) reveal overconfidence. The values adjacent to each point on the curve represent the number of observations contributing to the mean.

of evidence reduce to this single decision axis of subjective evidence (the internal dimension) that participants use to make scale-JOLs (and other types of metacognitive judgments, which we shall discuss further). In this model, the rating that is assigned to the item is jointly determined by the level that a given item assumes on the internal dimension and the placement of the confidence criteria. For example, items above the criterion n on the scale but below the criterion n+1 will be assigned the scale value associated with criterion n. To assess calibration using the traditional method, mean ratings are compared to mean accuracy. To do this in the simplest case, mean ratings are computed by weighting each rating level on the scale by the number of observations. For example, very few items in Figure 3.4 are assigned 10% (and they are mostly unrecalled items) so 10% would contribute little to the mean rating, whereas a large number of items are assigned 60% (and the majority of these are recalled) so they would contribute more. Accuracy is computed by simply dividing the total number of recalled items (i.e., the number of items populating the R distribution) by the total number of items (i.e., the 42

Rel ative and Absolute Accuracy

number of items populating the both the U and R distributions). Crucially, a signal detection model such as this assumes that criteria are variable. In the process that reflects mapping 1, people are able to adjust the placement of criteria that correspond to various scale values. Hence, when calibration is computed in the way just described, the ultimate results of the computation will reflect not only the distribution of items along the internal dimension but also the distribution of criteria. By describing scale values as criteria, the model explicitly incorporates the assumption that mapping of internal representations onto scale values is a malleable process which constitutes an important component of metacognitive judgments.

The internal dimension.

At first blush, the notion that the internal dimension in this model might be an amalgamation of several subdimensions (memory access, metacognitive cues, and so on) may seem flawed. This impression might stem from the common but erroneous assumption present in both the memory and metacognitive literatures that the internal dimension in signal detection theory simply reflects a unitary memory process,

10

20

30 40 50 60 70 80 90 100

U

R

Control

Low

Experimental

Low

High

Internal dimension

New

10

Internal dimension ∗

20 ∗

U

30 40 ∗ ∗

50 ∗

R

60 70 80 90 100 ns ns ns ns ns

High

Figure 3.4  Graphical depiction of performance on cycle 3 in the control and experimental groups of Zawadzka and Higham’s (2016) Experiment 1. The top panel shows unrecalled (U) and recalled (R) items distributed over the internal dimension, with recalled items having a higher mean on the internal dimension than unrecalled items. The vertical lines represent confidence criteria or cutoffs. Thus, for a JOL of 100%, an item must have enough evidence on the internal dimension to exceed the 100% criterion. An 80% JOL would be assigned to an item that exceed the 80% criterion but does not exceed the 90% criterion, and so on. In the experimental group (bottom panel), new, difficult pairs are introduced on cycles 2 and 3. These items have less evidence on the internal dimension and so are situated to the left of the repeated pairs. The introduction of these new pairs changes the item context by increasing the range of evidence over which all items in the experiment are distributed. This change in context causes participants to significantly (*) liberalize their lower confidence criteria (those associated with 10%, 20%, 30%, 40%, and 50%) so that the new items with less evidence are accommodated (i.e., not all assigned 0%). The upper confidence criteria (those associated with JOLs of 60% and above) remain static between the groups. The end result is a context effect on JOLs: mean JOLs assigned to the U and R repeated pairs are higher in the experimental group than the control group.

sometimes referred to as “memory strength” or “familiarity.” In this vein, recently Koriat (2012) aligned signal detection theory with the currently unpopular direct access view of metacognitive judgments: Several models assume that confidence in old/new memory recognition tests is scaled directly from the perceived familiarity of the probe […]. Thus, a single continuum (“signal strength”) is postulated, which is defined conjointly by the old/new response and the confidence level attached to it, so that confidence is essentially used as an index of memory strength. (p. 81)

However, although the internal dimension can be and often is treated simplistically in this manner, it need not be. For example, in recognition memory research, Wixted and colleagues (e.g., Wixted, 2007; Wixted & Mickes, 2010; Wixted & Stretch, 2004) have broadened the conception of memory strength to include two, qualitatively different memory processes: familiarity and recollection. That is, memory strength is not synonymous

with a single memory process (familiarity) as is often assumed. Rather, multiple memory processes feed into the single dimension of memory strength. Furthermore, strength is not a property of the stimulus itself as a direct-access conceptualization of the internal dimension would suggest, but can vary depending on the retrieval cues available at testing. In Wixted and Mickes’ (2010) own words: A key theoretical consideration in the CDP [continuous dual-process] model is that “memory strength” is not a unitary construct, which means that it is not properly conceptualized as a fixed property of the encoded memory trace independent of the conditions of retrieval. (p. 1046)

And later they state, “… strength is not solely a property of the memory trace. Instead, strength is a property of the signal that is returned in response to a retrieval cue” (p. 1046). Similarly, in applying signal detection theory to metacognitive judgments, the conception of the

Higham, Zawadzka, and Hancz akowski

43

internal dimension could be broadened even further to incorporate factors that have nothing to do with direct access from memory. For example, the presence of particular metacognitive cues (e.g., number of successful recalls on previous cycles of the experiment; to be discussed further) may increase an item’s placement on the internal dimension compared to a situation in which such cues are absent. This increase on the dimension would also elevate the scale value that is assigned (because a higher criterion is surpassed). The incorporation of factors like these into the dimension would make a signal detection model of metacognitive judgments more compatible with the inferential approach to metacognitive judgments that is more popular in contemporary research. An analogy for how several factors can feed into a single decision axis would be how people deal with complex decisions such as whether to buy a particular vehicle. Several considerations (sub-dimensions) such as fuel economy, appearance, size, age of the vehicle, and so on must be taken into account and feed into the decision. Vehicle 1 may be high on factors 1 and 2, but low on factors 3 and 4, for a medium net level on the decision axis. Vehicle 2, on the other hand, which is high on all four factors, would end up higher up on the decision axis. However, at the end of the day, the buyer has to decide whether to buy a particular vehicle or not and this decision process can be modeled using a single dimension decision axis of internal evidence—in this case, evidence for going ahead with the purchase. Interestingly, the more widespread view that metacognitive judgments such as JOLs directly represent subjective likelihood must also rely implicitly on an assumption of a single, heterogeneous internal dimension. Requiring participants to make single ratings on a 100% scale implies that there is a single subjective-likelihood dimension that is internal to participants that can be translated into scale values. If more than one dimension was assumed, presumably researchers would be attempting to measure those separate dimensions by requesting more than one rating or conducting more sophisticated analyses on the single rating that is obtained. Furthermore, no one assumes that subjective probability is a unitary construct; rather the common assumption underpinning the probabilistic interpretation of metacognitive judgments is that the several sub-dimensions feed into a single dimension (subjective probability) and that these sub-dimensions are the same as those we assume feed into the internal signal detection dimension as discussed previously. The main 44

Rel ative and Absolute Accuracy

difference between the signal detection and the subjective-likelihood approaches, then, lies in what the assigned scale values represent. With the signal detection approach, the scale values represent confidence criteria that are malleable and their placement is determined by mapping 1. Hence, a direct comparison of mean judgments with mean performance to assess calibration is meaningless. In contrast, the subjective-likelihood approach assumes that the scale values directly index subjective probability. Consequently, mean judgments and mean performance can be meaningfully compared to infer psychological OC, UC, or realism. Because mapping 1 plays no role, the scale values are subjective likelihood.

Scaling Biases

Despite the work described earlier in judgment and decision making that incorporates mapping 1 into the judgment process, mapping 1 is rarely considered in modern research on metacognitive judgments (although see Higham, 2007, 2011, 2013). We believe that this is a fatal flaw as there is ample evidence that scaling biases involving mapping 1 affect the ratings that participants make in a variety of tasks, which in turn influences whether participants are deemed well calibrated or not. For example, consider the spacing bias discussed by Poulton (1979; see also Parducci, 1963; Parducci & Perrett, 1971). Although Poulton was discussing psychophysical experiments, the principles likely apply to all category judgments such as those made on a 100-point metacognitive scale. The spacing bias occurs when stimuli are located at intervals that differ in size on the internal dimension yet the intervals are judged to be of equal size. One could argue that panel C in Figure 3.2 is an example of the spacing bias. In that case, there are two stimuli located close together at the bottom end of the internal dimension and three additional stimuli located at the top end. However, the judgments that are made to these stimuli are equally spaced on the rating scale such that although the internal difference between the second and third lowest stimuli is extremely large, the difference in judgment is the same (25%) as that between stimuli that are internally much closer together (e.g., the two stimuli that are internally the highest are also judged to be 25% different even though they are very similar). In Poulton’s words, In judging the stimuli, the observer uses their rank order of magnitude rather than their relative subjective sizes. He behaves as if all the stimuli were

subjectively equally spaced. […] Thus the smaller intervals are overestimated compared with the larger intervals. (pp. 786–787)

Admittedly, empirical support for the spacing bias has only been documented in psychophysical tasks such as those involving judgments of stimulus magnitude (e.g., white noise:  Montgomery, 1975). It has not been documented with metacognitive judgments about the correctness of one’s own (future or previous) performance. However, in our view, the spacing bias, as well as a host of other biases involving mapping 1 that are documented by Poulton (1979), are likely to be present in most tasks involving category judgments. What is true of all such tasks is that they necessitate the mapping of some internal, subjective dimension (whether that be “stimulus loudness” or “subjective evidence for later recall”) onto a scale divided into categories with numeric labels. There is no obvious reason why mapping 1 for a stimulus-based internal dimension should be any different from that for a metacognitive dimension. Certainly, the potential for such biases exists and cannot be ignored because the implications for research on calibration are considerable. To illustrate the profound effect that the spacing bias can have on calibration results, consider a hypothetical experiment in which there are two experimental groups with two conditions each (i.e., 2  × 2 mixed design) and a scenario like that depicted in panel C of Figure 3.2 exists (i.e., there is a spacing bias). In the first group, mean ratings are 25% and 50% for conditions 1 and 2, respectively, and the same is true of recall. In the second group, mean ratings are 50% and 75% for conditions 1 and 2, respectively, and again, the same is true of recall. Using typical calibration methodology, we would conclude that there were effects of group and condition (on both ratings and recall) but that participants were well calibrated (because mean ratings match recall probabilities in all cells of the design). However, this conclusion completely overlooks the fact that the subjective difference between conditions 1 and 2 in group 1 is about 10 times larger than the analogous difference in the group 2 despite the fact that the difference in recall between the two conditions in each group is identical. Stated this way, it is clear that participants were not well calibrated at all in this experiment: comparable effects of condition on recall between the groups were accompanied by vastly different effects of condition on subjective experience. Thus, the spacing bias led

to conclusions about calibration that were not just slightly inaccurate but completely wrong. However, had mapping 1 been taken into account when assessing calibration by, for example, modeling how scale criteria were placed in this experiment, this error could have been avoided.

Underconfidence with Practice

If mapping 1 is ignored and a scaling bias exists with some but not all metacognitive scales, or the nature of the scaling bias is different between different metacognitive scales, then there is a danger that results obtained with one measure will fail to generalize to others. Put differently, if mapping 1 differs between two different measures, then divergent conclusions regarding calibration from those two different scales are not only possible but, in fact, very likely. In this vein, recently Hanczakowski, Zawadzka, Pasek, and Higham (2013) documented differences between percentage-scale JOLs and binary JOLs (as well as other binary tasks) in a series of experiments on the UWP effect. Previous research has shown that repeated study-test cycles of the same cue-target pairs improve cued-recall performance. However, whereas scale-JOLs also increase across cycles, they do not increase as much as recall. The result is realism or slight OC on first cycle, but UC on the second and subsequent cycles (e.g., Finn & Metcalfe, 2007, 2008; Koriat, 1997; Koriat, Sheffer, & Ma’ayan, 2002). Hanczakowski et  al. (2013) replicated this typical UWP pattern in their first experiment using scale JOLs. However, in their second experiment, instead of participants being required to use a 0–100% scale to make JOLs, they simply made yes/no judgments as to whether they would later recall the items—a task we referred to as binary JOL task. With a binary task, realistic participants should respond “yes” to approximately the same proportion of items that they later recall, and discrepancies between these values reflects either OC or UC. Thus, by comparing the proportion of “yes” responses to the proportion of successful recalls, calibration can be assessed. Presumably, if participants were truly UC (i.e., the value associated with the items on the internal probability dimension was too low compared to recall probability), then participants would demonstrate that UC on a variety of different measures including binary ones. However, this was not the case; for the binary task, there was some OC on the first cycle, but no evidence whatsoever for UC on the second or third cycles.

Higham, Zawadzka, and Hancz akowski

45

In their later experiments, Hanczakowski et  al. (2013) found similar dissociations between scale judgments and a binary betting task for which participants were asked to bet whether they would recall each item later. If they bet that they would recall an item, they gained points if recall was successful, but lost points if it was not (see also McGillivray & Castel, 2011). No points were gained or lost if participants chose to pass. As with the binary JOL task, binary betting produced excellent calibration on later cycles, but scales showed the usual UWP pattern. In their Experiment 4, this outcome was obtained even for participants who made both judgments to each item (i.e., a scale-JOL followed by a binary-betting decision). In this particular experiment, given that the scale-JOL always preceded the betting decision, it was ensured that whatever information participants were incorporating into their scale-JOLs was also available for the betting decision. Overall, the results of Hanczakowski et al. (2013) make it clear that very different results are obtained when JOLs are elicited with a 100-point scale versus a yes/no decision, yet there is no principled reason to be derived from the probabilistic interpretation of calibration that should lead to this outcome. There is nothing privileged about 100-point JOL scales; they have just become the norm in metacognitive research and there is no a priori reason to assume that binary tasks (yes/no JOLs or yes/no betting) should yield different results. It seems reasonable to assume that one source of the difference between scale and binary calibration lies with mapping 1. That is, when queried about future recall, people may differ in the way that they map levels on an internal dimension to the multiple criteria that correspond to scale values (scale JOLs) versus the single criterion which warrants a “yes” response (binary JOL). These differences in mapping 1 underlie differences in calibration evident with different measurement methods. The divergence of results obtained with various measurements does not prove, however, the vital role of mapping 1 in understanding metacognitive judgments made on a percentage scale. After all, it could be that likelihood scales are so well defined in our social milieu that they allow for a single way of mapping internal confidence onto scale values. What is thus needed is a clear demonstration that scale values are malleable, as assumed by the signal detection model of metacognition. Next we will develop the case for the malleability of values assigned on a likelihood scale. We will first introduce the recalibration 46

Rel ative and Absolute Accuracy

hypothesis, which we believe provides a reasonable account for the scale/binary dissociations, and then we will describe our recent findings supporting this hypothesis.

The recalibration hypothesis.

One of the basic premises underpinning the recalibration hypothesis is that scale-JOLs in metacognitive experiments are specific to the experimental context and cannot be taken outside of that context to represent some form of absolute subjective probability. Instead, the hypothesis assumes that scale-JOLs are essentially ordinal confidence ratings that participants use to rank order items given a specific context. In other words, scale-JOLs are dependent on where criteria are placed in the mapping-1 process. If that context changes, then the assignment of rating to items may also change despite the fact that the underlying psychological evidence on the internal dimension has not changed. Stated differently, mapping 1 in Figures 3.1 and 3.2 is affected by context. Clearly, if these assumptions are correct, then directly comparing mean scale-JOLs with mean memory performance to establish whether participants are OC, UC or realistic is largely a nonsensical exercise. It is worth noting that the recalibration hypothesis is conceptually similar to Frederick and Mochon’s (2012) scale-distortion theory of anchoring judgments. The multicycle paradigm is an ideal circumstance for recalibration effects to occur. The same items gain evidence on the internal dimension on each successive cycle, with some being recalled on all three cycles. The consequence is that as the procedure progresses, there are multiple items that are very high on the internal dimension, which constitutes a context that is completely different from the first cycle. According to the recalibration hypothesis, on later cycles, participants are in a bit of a quandary: there are multiple items that participants are highly confident about recalling, but at the same time, there are demands to discriminate between the items in some meaningful way. Consequently, whereas in the context of difficult items (e.g., cycle 1), an item with a large amount of evidence on the internal dimension may be assigned 90%, in the context of multiple easy items (e.g., cycle 3), that rating may be recalibrated and lowered to, say, 80%. The recalibration hypothesis assumes that this reduction occurs because it is likely that in the context of multiple items with high amounts of evidence, there will be some items with even more evidence than the item currently being judged. The

extra evidence for these extreme items may need to be accommodated by lowering the JOL assigned to the current item so that its ranking is lower. In reality, of course, both the current item being judged and the other items with more extreme evidence are likely to be recalled. Thus, both the extreme items and the item being judged require the same, equally high ratings (near 100%) for calibration to be realistic. The fact that this requirement is not met contributes to the UWP effect: whereas recall may approach 100% for all these items, mean scale-JOLs are somewhat less because of participants attempts to rank order items with large amounts of evidence. Critically, however, if instead of being required to make scale-JOLs participants are simply asked whether they will later recall an item (binary JOL) or whether they are willing to bet on recalling an item (binary betting), the same pressure to rank order the items is not present. Instead, all items that exceed the criterial amount of evidence will receive “yes” responses; there are no degrees of “yes” available. The result is that both the extreme items and the current item with somewhat less evidence receive “yes” responses. Later, those items are indeed recalled, which results in good calibration. Recently, Zawadzka and Higham (2016, Experiment 1) garnered evidence for item-context effects that underpin the recalibration hypothesis. They employed a procedure consisting of three study-test cycles as with UWP research. On cycle 1, 60 pairs of unrelated cue-target pairs were presented to all participants for study and scale-JOL ratings. Twenty of these 60 items were designated as critical items. On cycles 2 and 3, the control group was presented with the same 60 pairs again, just as in the typical UWP design. In contrast, only the 20 critical items were repeated on cycles 2 and 3 in the experimental group; the other 40 items consisted of 20 new unrelated word pairs and 20 new nonword-word pairs. We hypothesized that the inclusion of the new, difficult items in cycles 2 and 3 of the experimental group would cause participants to recalibrate the scale. In particular, lower JOL values (e.g., 10, 20, 30%, and so on) would be reserved to rate the new, difficult items, leaving only the higher JOLs scale values to be assigned to the critical items. Such recalibration would increase their JOL mean relative to the control group that rated only repeated items. Importantly, we assumed that recalibrating the scale in this manner would have no effect on the locations of the critical items on the internal dimension. In other words, there would be no difference in the level of genuine, psychological UC between

the conditions. Relatedly, we also assumed that there would be no differences in critical item recall between the control and experimental groups. We assumed instead that the difference in mean JOLs between the control and experimental groups would be entirely due to a change in mapping 1. The results confirmed our assumptions. On cycle 1, there was no difference in mean JOLs or mean recall between the control and experimental groups, which was expected because the experimental manipulation (the addition of new, difficult filler items) did not occur until the later cycles. On cycle 2, there was again no difference in recall (56% control; 58% experimental), but as expected, the experimental group had a higher JOL mean (46%) than the control group (38%). Similarly, there was no difference in recall between the groups on cycle 3 either (74% for both the control and experimental groups), but again, the JOL mean was higher for the experimental group (68%) than the control group (60%). Thus, the inclusion of new, difficult items on cycles 2 and 3 increased the JOLs that participants assigned to the repeated critical items, but it had no effect on recall, just as the recalibration hypothesis predicted. To examine the results in more detail, we conducted a receiver operating characteristic (ROC) analysis, which is shown in Figure 3.5. To construct an ROC, several hit rates (HRs) and false alarm rates (FARs) are plotted against each other by treating the scale values as response criteria, one criterion for each 10% increment on the scale. For a metacognitive task, the HR and FAR are defined as the proportion of correct and incorrect responses at or above a given scale value (criterion), respectively. For example, there may be 20 correct (recalled) items that were assigned 80% or higher out of a total number of 50 correct responses for HR = 0.40. Similarly, there may be 10 incorrect (unrecalled) items assigned 80% or higher out of a total of 40 incorrect responses for a FAR  =  0.25. The point (0.25, 0.40) could then be plotted on the ROC to correspond to the 80% criterion. It is possible to use ROC analysis to gain insight into where the criteria are located on the internal dimension. The ROCs in Figure 3.5 suggest that for cycle 3, the lower confidence criteria—those associated with 10, 20, 30, 40, and 50%—were all more liberal (further to the top-right of the figure where HRs and FARs are larger) in the experimental condition compared to the control condition. In contrast, the upper values on the JOL scale (those located toward the bottom-left of the figure) remained static between the groups.

Higham, Zawadzka, and Hancz akowski

47

Lower criterion levels (in this case, those associated with 10% and 20%) are more liberal (i.e., higher HRs and FARs) in the experimental group than the control group

1 Upper criterion levels (in this case, those associated with 90% and 100%) are 0.8 similar between the control and experimental 0.6 groups HR

Control Experimental

0.4

0.2

0

0

0.2

0.4

0.6

0.8

1

FAR Figure 3.5  ROC curves for cycle 3 of the control and experimental groups of Zawadzka and Higham’s (2016) UWP experiment. The curves lie virtually on top of each other, but the lower confidence criteria that are associated with low JOL values (10–50%) are more liberal (further to the top-right of the figure) than in the control group. The higher JOL values (i.e., 60–100%) do not statistically differ. HR = hit rate (the proportion of recalled items given the criterial JOL or higher); FAR = false alarm rate (the proportion of unrecalled items given the criterial JOL or higher); ROC = receiver operating characteristic.

The ROC results suggest that participants altered their lower confidence criteria in response to the inclusion of the new items; that is, the scale was recalibrated by altering mapping 1 to accommodate the new context created by the inclusion of difficult items on later cycles, increasing the JOL mean of the critical, repeated items. However, the position of these critical items on the internal dimension was not affected. A graphical depiction of this situation is shown in Figure 3.4.

Cue utilization

The recalibration hypothesis suggests that the difference Hanczakowski et  al. (2013) observed between scale-JOLs and binary decisions was attributable—at least in part—to context-driven recalibration of the JOL scale. That is, for the scale-JOLs, mapping 1 (shown in Figures 3.1 and 3.2) changed over cycles because the context (average difficulty level of the items) changed. By contrast, because of less need to rank order the items, recalibration does not occur with binary judgments in the same way as with scale-JOLs, producing the scale/binary dissociation. 48

Rel ative and Absolute Accuracy

In the present section, we again review our research employing the multicycle procedure (Zawadzka & Higham, 2015) and introduce another factor that may contribute to the dissociation between scale and binary judgments observed by Hanczakowski et al. (2013): the fact that scaleJOLs may be sensitive to certain metacognitive cues that binary judgments are not. In contrast to the previous considerations on the recalibration hypothesis, the research about to be discussed might be interpreted to suggest that qualitatively different information feeds into scale and binary judgments. In other word, there may be differences in the nature of the internal dimensions themselves, not just differences in the mapping of overt responses onto internal dimension values. Henceforth, we refer to this as the different-dimensions hypothesis. If the different-dimensions hypothesis is valid, it involves a fundamentally different type of mechanism from scale recalibration described earlier, one that has the potential to produce genuine psychological UC. An assumption behind the different-dimensions account is that the demands associated with

scale-JOLs are different and potentially greater than those for binary judgments. In particular, participants attempt to rank order the items using scale-JOLs, even items that they believe they will ultimately later recall. For binary tasks, however, fine-grained item discriminations are unnecessary because participants need only respond “yes” or “no.” It is true that there is ranking in the sense that items assigned “yes” are ranked higher than items ranked “no,” but discriminating between items assigned “yes,” for example, is not essential. Thus, for scale-JOLs, but not binary decisions, participants may be searching for cues to accomplish the rank ordering. To test this possibility, Zawadzka and Higham (2015) conducted two three-cycle experiments, one using scale-JOLs and the other using the binarybetting task described earlier. During testing on the third cycle of each group, participants were asked how many times they had previously recalled each item on the previous two cycles. The responses available were zero (not recalled on either previous cycle), once (recalled on one previous cycle but not the other), or twice (recalled on both previous cycles). Previous research by Finn and Metcalfe (2007, 2008; see also Ariel & Dunlosky, 2011; England & Serra, 2012) has shown that memory for past test performance is an important and valid cue for scale-JOLs in the multicycle paradigm: previously recalled items are both more likely to be recalled on the current cycle and to be assigned substantially higher JOLs. However, is the same true of items previously recalled once versus twice? Potentially, participants may still use multiple versus single recall as a cue in making their JOLs despite the fact that it had no effect on recall performance. Also, if participants do make use of this cue when making scale-JOLs, do they also make use of it if the task is binary? As already noted, the search for cues may not be as intensive with binary tasks because the need to rank order items is less, so there is the possibility that the number of previous successful recalls acts as a metacognitive cue for scale tasks, but not binary ones. The first aspect of Zawadzka and Higham’s (2015) results worth noting is that the dissociation observed by Hanczakowski et al. (2013) was replicated. That is, the UWP effect was observed with scale-JOLs on cycles 2 and 3, but was absent with the binary betting task. More critical is the once versus twice recall distinction. Mean recall in the scaleJOL group on cycle 3 for items judged as previously recalled once versus twice was similar and very high: 94% and 96%, respectively. However, despite the fact that the once versus twice cue had very little

effect on recall, participants still used it in making their scale-JOLs. In particular, mean scale-JOLs for once versus twice previously recalled items were 71% and 84%, which differed significantly. Recall in the binary-betting group was also very high if items were judged to have been recalled previously: once = 96%; twice = 98%. However, unlike the scale-JOL group, there was also no difference in the proportion of bets and the betting proportion was very well calibrated. Items judged to have been recalled once had a betting proportion of 96% whereas items judged to have been recalled twice had a mean betting proportion of 99%. Thus, whereas participants were sensitive to the once versus twice cue when making scale-JOLs, they were not sensitive to it when making binary betting decisions. As already noted, the different-dimensions hypothesis suggests that participants in the binary and scales groups varied in their sensitivity to the once/twice cue because there was different information forming the basis of the two judgments. Unlike our research on the UWP effect described earlier, the utilization of different information between the scale and binary tasks seems, on the surface, incompatible with an explanation based on mapping 1. However, we believe that it is critical to distinguish between observed sensitivity to metacognitive cues on the one hand and actual sensitivity on the other. This distinction is depicted in Figure 3.6, which shows distributions of U and R items on cycle 3. Within the R distribution are items that were previously recalled once versus twice, designated in Figure 3.6 as 1s versus 2s, respectively. The data from Zawadzka and Higham’s (2015) experiment indicated that the vast majority of previously recalled items were also recalled on cycle 3, so no such items populate the U distribution. The first aspect of Figure 3.6 to note is that the twice-recalled items are located further up the dimension that the once-recalled items, and this difference exists for both the binary and scales tasks. The difference in the location of these items on the internal dimension suggests that participants are actually sensitive, at a metacognitive level, to the once/twice cue in both the binary and scales tasks. In other words, Figure 3.6 suggests that there was no between-task difference in the information feeding into the internal dimensions, undermining the different-dimensions hypothesis. What then explains the observed dissociation in Zawadzka and Higham’s (2015) research? The answer here lies in the placement of the binary Y/N criterion: because participants were attempting to

Higham, Zawadzka, and Hancz akowski

49

10

20 30 40 50 60 70 80 90 100

U

R 1 1 11 2 1 1 2 1 2 2 1 2 2 2

Scales

Low

High

Internal dimension Y/N

U Binary

Low

Internal dimension

R 1 1 11 2 1 1 2 1 2 2 1 2 2 2 High

Figure 3.6  Graphical depiction showing how mapping 1 could be the source of the results reported by Zawadzka and Higham (2015). In the top panel representing the scales condition in cycle 3, criteria are spread across most of the internal dimension. Consequently, the upper criteria (70% +) are positioned to discriminate between twice-recalled (2) and once-recalled (1) items, all of which are recalled on cycle 3. Conversely, in the cycle-3 binary condition depicted in the bottom panel, all once- and twice-recalled items fall above the criterion, meaning that they are all assigned “yes” responses. Thus, even though the evidence feeding into the internal dimension is the same for the scales and binary tasks, overt discrimination between once- versus twice-recalled items is undetected in the binary condition because the yes/no (Y/N) criterion is placed too liberally for that particular discrimination to be made. U = unrecalled on cycle 3; R = recalled on cycle 3.

discriminate between R versus U items in their betting decisions, rather than between items that were previously recalled once versus twice, the binary criterion is far to the left of the previously recalled items. As Figure 3.6 shows, this criterion location would mean that all previously recalled items would fall above the Y/N criterion, yielding near ceiling HRs and FARs (i.e., participants responded “yes” to nearly all previously recalled items, which is consistent with the data). With the HRs and FARs equal to each other (i.e., both equal to one), no discrimination between once- versus twice-recalled items would be evident. Thus, even though participants were actually sensitive to the once/twice cue in both the binary and scales tasks, observed sensitivity would have been masked in the binary condition because of the Y/N criterion placement.1 The same was not true in the scales condition; because there are multiple confidence criteria spread across most of the internal dimension, the upper criteria would have HRs and FARs that were not at ceiling, allowing for observable once/twice discrimination. The end result was observed sensitivity to the once/ twice distinction in the scales group, but not in the 50

Rel ative and Absolute Accuracy

binary group, even though there was no difference in the information utilized between the binary and scales tasks. Future research distinguishing between the different-dimensions hypothesis and an account based on mapping 1 might focus on procedures designed to persuade participants to set a more conservative binary criterion (e.g., the metacognitive payoff matrix might be altered to heavily penalize false alarms). With the criterion set more conservatively, the HRs and FARs associated with the once/ twice discrimination would no longer be at ceiling, perhaps revealing sensitivity to the once/twice cue in the binary task that had previously been masked. On the other hand, if participants still demonstrated no sensitivity despite a more conservative binary criterion, then support would be garnered for the different-dimensions hypothesis.

Are Binary Tasks Always Better Calibrated Than Scale Tasks?

The criterion-placement hypothesis suggests that binary judgments may not be as sensitive as scale judgments in some situation. Hence, an important

question at this juncture is whether binary judgments are always better calibrated than scale-JOLs across different paradigms and metacognitive measures. If they are, then the advice to metacognitive theorists would be to stop using scale-JOLs to assess calibration and make use of binary tasks instead. After all, it would be preferable to know whether an observed effect is based on an artifact of percent scales before developing psychological theories to account for it, which is what appears to have happened with the UWP effect. In this section, we describe some research designed to address this question with retrospective confidence judgments (RCJs) in a classroom setting (see Tidwell, Chrabaszcz, & Dougherty, 2016; Koriat, 2016, for further discussion of RCJs). The research about to be described is an extension of Higham’s (2013) work. He investigated the calibration of RCJs in an actual university classroom using introductory psychology materials. Students were administered a multiple-choice test with either four or five alternatives on the first day of class to assess their knowledge of psychology prior to any lectures. Higham was primarily interested in whether students could use the plurality option (Luna et al., 2011; Luna & Martin-Luengo, 2012) to regulate their test accuracy. This required students to create answers containing one alternative (single answer) as well as answers containing three alternatives (plural answer) and then to choose one for grading (part marks awarded for plural answers containing the correct alternative). However, for present purposes, we will focus only on the calibration results for single answers regardless of whether they were chosen. Assessing calibration using the method common in the literature (i.e., comparison of mean ratings and mean memory performance), Higham (2013) found that there was UC. This was an unusual result given the large literature showing that answers consisting of a single alternative on multiple-choice tests tend to demonstrate OC (e.g., Dunlosky & Rawson, 2012; Griffin & Tversky, 1992; Koriat, Lichtenstein, & Fischhoff, 1980; Liberman, 2004; Luna et  al., 2011; Sieck, Merkle, & Van Zandt, 2007), a point to which we shall return. In particular, a 2 (number of alternatives per question: 4, 5)  × 2 (measure:  accuracy, subjective likelihood) mixed analysis of variance (ANOVA) with measure as the only within-subjects variable, (an analysis not reported by Higham), yielded a main effect of measure, F(1, 168) = 20.26, p < .001, η2 = 11. Mean scale-JOLs (40%) were significantly lower

than mean accuracy (45%). However, this main effect was qualified by a significant interaction, F(1, 168) = 5.80, p = .02, η2 = .03; UC was larger for the four-alternative test (8%) than the five-alternative test (3%), an effect wholly attributable to the fact that accuracy was higher for the former test (48%) than the latter (43%) whereas rated likelihood remained constant (both tests: 40%). More recently, Higham, Zawadzka and Hanczakowski (2016) followed up this result and investigated whether a binary judgment task would eliminate the UC that Higham (2013) observed in his study, just as it eliminated UC in the UWP paradigm. There were a few changes from the original experiment. First, to test the robustness of the UC effect, students were tested on material after they had received lectures on it. In all likelihood, the UC observed by Higham was at least partly attributable to the fact that students knew they were writing a test on material they had yet to learn, causing them to set their confidence criteria conservatively. Second, because general-knowledge (GK) questions are common in metacognitive research, the test contained a mixture of 12 introductory psychology questions and eight GK questions randomly intermixed (20 in total). Third, after judging the correctness likelihood of all the test questions on a percent scale, participants were asked to return to each answer and make binary Y/N confidence judgment (i.e., to indicate whether they were confident or not that each answer was correct). This aspect of the methodology was similar to Experiment 4 of Hanczakowski et al. (2013) in which a binary Y/N judgment followed a scale rating (although both ratings in the current study pertained to retrospective rather than prospective memory performance). As with Hanczakowski et al., this aspect of the design ensured that all participants had the same memorial and metacognitive information available at the time each judgment was made. Finally, all questions had only two alternatives rather than four or five, again because this question format is more common in the metacognitive literature (e.g., see Koriat, 2012 for a review). Unlike Higham (2013), the results indicated good calibration for scale-RCJs. Specifically, for GK questions, mean accuracy was 65% and mean scale-RCJs was 62%, a difference that was not significant. Similarly, for introductory psychology questions, mean accuracy was 67%, which did not differ significantly from mean scale-RCJs of 66%. As noted earlier, this result suggests that Higham’s (2013) observation of UC was likely partly due to

Higham, Zawadzka, and Hancz akowski

51

participants being tested before being exposed to the material. However, very different results were obtained with binary confidence ratings. In particular, the mean proportion of high confidence judgments was 33% and 42% for GK and introductory psychology questions, respectively, producing UC of  –32% for psychology questions and  –25% for GK questions! Thus, these results are effectively a mirror image of those found by Hanczakowski et al. (2013, Experiment 4). That is, whereas Hanczakowski et al. found that scale judgments exhibited UC whereas binary judgments (following scale judgments) were well calibrated, Higham et al. (2016) found that scale judgments were well calibrated whereas binary judgments (again following scale judgments) exhibited very large UC! Of course, there are a number of differences between the studies, not least of which is that different metacognitive judgments were involved (JOLs in Hanczakowski et al.; RCJs in Higham et al.), but the disparate results highlights an important take-home message from our research: we are not arguing that binary judgments are always better calibrated than scale judgments in all circumstances because they are a more accurate index of people’s calibration. Either judgment type might result in more accurate calibration (as determined by a comparison of judgment versus performance means) depending on the experimental circumstances. Rather, we believe that the important lesson from this research is that metacognitive scale judgments alone should not be taken at face value to be a direct measure of subjective likelihood. That being the case, metacognitive researchers must abandon the practice of relying solely on mean scale judgments as if they are the gold standard, ignoring other potentially equally valid judgments of absolute metacognitive accuracy (e.g., binary tasks) that yield completely different results for no obvious reason. Nor should binary judgments alone be taken at face value to assess calibration. Rather, we think it is critical to incorporate the influence of mapping 1 into the calibration estimate and seek converging evidence from multiple judgment types if meaningful statements are to be made about genuine psychological UC (or OC). Our research highlights another issue that is easily missed: binary judgments are no freer from mapping-1 variability than scale judgments. Just as scale judgments are malleable and context dependent because of mapping 1, the same is true for binary judgments. The only difference is that there are multiple cutoffs to consider with scales (those 52

Rel ative and Absolute Accuracy

associated with different levels of the scale: 10, 20, 30%, and so on) whereas there is only one with binary judgments (the Y/N criterion). Hence, one way to understand UC results of Higham et al. (2016) with the binary task is that participants set a Y/N confidence criterion that was too conservative for the proportion of high confidence judgments to match accuracy. To elaborate on this point, suppose we had created an experimental context to encourage more liberal responding in the binary task (e.g., informed participants that the test would be easy; see England & Serra, 2012, for an example), resulting in a match between the high-confidence proportion and accuracy. Could we conclude that participants are now genuinely less UC at a psychological level than they were without this instruction? In our opinion, the answer to this question is “no”! Informing participants that the test is easy is simply a way of altering how participants map Y/N judgments onto the internal dimension (i.e., mapping 1 applied to binary judgments). It does not alter the perceived amount of evidence or the subjective likelihood of the items. Hence, in our view, participants are no more or less UC given a liberal binary task in which the “yes” proportions match performance than if their responding was more conservative and the means did not match. Naturally, the same arguments apply to liberal versus conservative scale judgments with multiple criterion cutoffs.

Confidence Criteria and Resolution

A second important metacognitive index of performance is known as resolution or relative metacognitive accuracy. Unlike calibration (or absolute metacognitive accuracy), good resolution is not dependent on matching mean performance and mean scale values assigned in a metacognitive rating task. Instead, this index indicates the degree to which participants are able to discriminate the correctness of their own answers regardless of the scale that is used. For example, suppose mean recall on a memory test was 50% and the mean JOL rating assigned to recalled versus unrecalled targets was 100% versus 80%, respectively. Although calibration would be poor in this case (OC: mean JOLs > mean recall), resolution would be good. Because higher JOLs were given to the items that were actually recalled than the ones that were not, the participant was successful at using the JOL scale to discriminate between correct (recalled) and incorrect (unrecalled) responses. Note that for

calibration, typically only a percent or proportion scale would be suitable because it must be compatible with the performance scale so that a direct comparison can be made, whereas for resolution, the units of measurement are immaterial (e.g., a 1–6 confidence scale could be used). By far the most common measure of resolution that is used in the metacognitive literature is G (the Goodman-Kruskal gamma co-efficient; Goodman & Kruskal, 1954). This measure is an ordinal measure of association that has been recommended by Nelson (1984, 1986a, 1986b). However, recently some problems have been identified with G (e.g., Benjamin & Diaz, 2008; Rotello, Masson, & Verde, 2008). For example, in a series of simulations, Masson and Rotello (2009) demonstrated how it is affected by response bias. In particular, as response bias varied from conservative to liberal, G produced a U-shaped function:  at extremely conservative and extremely liberal criterion settings (low versus high criterion placements on the internal dimension, respectively), G tended to be high, whereas it was low at more moderate levels of bias. Given that G is meant to be measuring resolution, it should be independent of response bias (FAR), so the fact that it was not suggests that G is poor measure of resolution. To understand this criticism, consider again the top panel of Figure 3.4. Whereas the vertical lines in that figure represent criteria or cutoffs, the amount of overlap of the U and R distributions (the degree to which are separated) indicates resolution. If the distributions were completely overlapping, then discriminating R (correct) responses from U (incorrect) responses would be impossible. The further the distributions separate, the better discrimination (and resolution) will be. Critically, the location of the criteria has nothing to do with the amount of overlap; so as long as the overlap remains constant, any index of resolution should also remain constant regardless of the placement of the criteria. To demonstrate how G is affected by response bias, we conducted simulations analogous to those reported in Masson and Rotello (2009), but extended those simulations by testing bias effects on an index of resolution that they did not consider:  the area under the ROC curve computed with the trapezoidal rule, explained in more detail later on. Specifically, we made assumptions regarding the nature of the underlying evidence distributions and criterion placements, randomly sampled observations from those distributions, and recorded where the observations were located on the evidence

dimension with respect to the criteria. For one set of simulations, 200,000 observations were randomly sampled each from two equal-variance Gaussian distributions whereas for the other, the same number of observations were randomly sampled from rectangular (uniform) distributions. For the Gaussian simulation, the U (noise) distribution had M = 0 and SD = 1 whereas the R (signal) distribution had M = 1.5 and SD = 1 (i.e., distance between means  =  1.5). For the rectangular simulation, the U distribution uniformly varied between  –1 and 1 whereas the R distribution uniformly varied between  –0.5 and 1.5 (i.e., distance between means = 0.5; SD for both distributions = 0.58). In both the Gaussian and rectangular cases, there were 11 criteria that represented ratings corresponding to 0, 10, 20… 100%. The lowest and highest criteria were deliberately set at extreme locations so that all and no observations fell below and above them, respectively. This was done so that we could accurately compute the area under the ROC curve, as we shall explain. The middle criteria varied depending on the condition. For the no bias condition, the 50% criterion was placed midway between the R and U distributions and the others located at increments of 0.40 SD units (i.e., z-scores) above and below that midpoint (with the exception of the 0% and 100% criteria that were set at extremes as explained previously). For the liberal and conservative conditions, the same process was followed except the 50% criterion was set further down and further up the internal dimension, respectively. In particular, for the Gaussian simulation, the 50% criterion was placed at .75, –1.5, and 1.5 SD units for the no bias, liberal and conservative conditions, respectively. For the rectangular simulations, the 50% criterion was placed at .25, –1.5, and 1.75 SD units for the no bias, liberal and conservative conditions, respectively. ROC curves for the three bias conditions for each simulation type are shown in Figure 3.7. The next step was to compute G at each criterion based on a 2 × 2 design. For example, for the no bias Gaussian simulation, there were 1,919 U observations and 39,674 R observations that fell above the 90% criterion. Because there were 200,000 observations drawn from each distribution in total, that meant that 198,081 U observations and 160,326 R observations fell below the 90% criterion. Using these frequencies, a 2 × 2 contingency table could be generated (a = 39,674, b = 1,919, c = 160,326, d = 198,081) and G computed using the standard formula,

Higham, Zawadzka, and Hancz akowski

53

Rectangular

No Bias

1.0

1.0

0.8

0.8

0.6

0.6

HR

HR

EV Gaussian

0.4 0.2

0.4 0.2

0.0

0.0 0.0

0.2

0.4 0.6 FAR

0.8

1.0

0.0

0.2

0.4 0.6 FAR

0.8

1.0

0.0

0.2

0.4 0.6 FAR

0.8

1.0

0.0

0.2

0.4 0.6 FAR

0.8

1.0

1.0

0.8

0.8

0.6

0.6

HR

HR

Liberal Bias 1.0

0.4 0.2

0.2

0.0

0.0 0.0

0.2

0.4 0.6 FAR

0.8

1.0

1.0

Conservative Bias 1.0

0.8

0.8

0.6

0.6

HR

HR

0.4

0.4

0.4

0.2

0.2

0.0

0.0 0.0

0.2

0.4 0.6 FAR

0.8

1.0

Figure 3.7  ROC curves for both equal-variance (EV) Gaussian and rectangular (uniform) evidence distributions. The top, middle, and bottom panels show ROCs for no bias (criteria spaced evenly across the dimension), liberal bias (criteria clustered toward the bottom end of the dimension), and conservative bias (criteria clustered toward the top end of the dimension). HR = hit rate; FAR = false alarm rate.

G=

Nc − Nd ad − bc (1) = Nc + Nd ad + bc

where Nc and Nd are equal to the number of concordant and discordant pairs, respectively. This process was repeated for each criterion and the results for the no-bias, Gaussian simulation are shown as the dotted line in the top-left panel of 54

Rel ative and Absolute Accuracy

Figure 3.8. It is clear from the figure that we replicated Masson and Rotello’s (2009) results. That is, G was highest at extreme FARs and lowest for moderate values. It is also clear from the other panels in Figure 3.8 that the same was true of G for all the other simulations as well regardless of whether the criteria were set with no bias, liberally, or conservatively. The problem is particularly noticeable

with rectangular distributions. For example, for the rectangular simulation with no bias (top-right panel in Figure 3.7), G varies between 1.00 at the extremes to less than half that value (0.46) for moderate FAR values! We also computed d' (the distance between the means of the evidence distributions in SD units of the standard normal distribution) at each criterion. To compute d', we used the same 2 × 2 contingency matrix as for G and computed the HR and FAR at each criterion using the following formulae, HR = a / ( a + c )(2) FAR = b / ( b + d )(3)

EV Gaussian

1.4 1.2 1.0 0.0

0.2

0.4

0.6

d' = z ( HR ) − z ( FAR )(4)

where z(HR) and z(FAR) are the z-scores associated with the HR and FAR, respectively. The results are shown as the solid lines in each graph in Figure 3.7. For the Gaussian simulations, d' remains roughly constant across variations in bias (FAR), regardless of whether bias is manipulated within or between simulations. The exception to this constancy is for extremely liberal (high FAR) bias shown in the middle-left panel of Figure 3.7 where d' appears to dip. However, this dipping Rectangular

No Bias

1.6

0.8

d' could then be computed at each criterion using the following formula,

0.8

1.0

1.6 1.4 1.2 1.0 0.8 0.6 0.4

0.0

0.2

0.4

FAR

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

FAR Liberal Bias

1.6 1.4 1.2 1.0 0.8

0.0

0.2

0.4

0.6

0.8

1.0

1.6 1.4 1.2 1.0 0.8 0.6 0.4

0.0

0.2

0.4

FAR

FAR Conservative Bias

1.6 1.4 1.2 1.0 0.8

0.0

0.2

0.4

0.6

0.8

1.0

1.6 1.4 1.2 1.0 0.8 0.6 0.4

0.0

FAR

0.2

0.4 FAR

d'

2×2 G

Figure 3.8  Goodman-Kruskal gamma correlation (G) and d' as a function of bias (false alarm rate; FAR). G increases as bias becomes liberal or conservative (U-shaped) regardless of the distributional assumptions. Conversely, d' remains relatively stable over varying levels of bias for Gaussian evidence distributions (except at extreme levels of bias where the indices are based on very few observations). However, for rectangular evidence distributions, d' performs as poorly as G. EV = equal variance.

Higham, Zawadzka, and Hancz akowski

55

could be due to error associated with the very few number of observations in the c cell of the 2  × 2 design at extremely liberal criteria. For example, there were only four observations (out of 200,000) in the c cell at the 10% criterion (i.e., where the c cell constitutes observations drawn from the R distribution that fell below the 10% criterion of the 2 × 2 design). However, for the rectangular simulations, d' performed as poorly as G showing clear variation over changes in the FAR. This variation is unsurprising given that d' is designed to measure discrimination for Gaussian evidence distributions, but it has important implications for metacognitive researchers interested in using a unbiased index of resolution. Masson and Rotello (2009) recommended parametric measures of discrimination based on an assumption of Gaussian distributions because those measures are unaffected by bias as long as the distributional assumptions are met. However, the rectangular simulations show that if the underlying distributions are not Gaussian, such measures may be as problematic as G. As very little is known about the nature of the evidence distributions in metacognitive tasks (i.e., whether they are Gaussian or not), we believe Masson and Rotello’s (2009) recommendations may be premature. For example, in our own work on the UWP effect, we’ve found that on the third cycle, the shape of ROC (which can be used to determine the nature of the underlying evidence distributions) does not suggest Gaussian distributions. For this reason, we now turn to alternative indices of resolution.

Area under the ROC curve.

In most cases, theorists do not compute G at each confidence cutoff as we have done, but instead use confidence ratings to compute a single measure of G. Masson and Rotello (2009) demonstrated problems with confidence-based G as well as binary G and we sought to replicate their findings here. In addition, we explored how another potential index of resolution not considered by Masson and Rotello might fare as bias was varied: the area under the ROC curve which can be computed straightforwardly using the trapezoidal rule. Computed in this manner, the area under the curve is often referred to as AUC or Ag (e.g., Green & Moses, 1966; Pollack, Norman, & Galanter, 1964). In particular, the formula for AUC is, n

AUC = 0.5 ∑ ( HR k+1 + HR k )( FAR k+1 − FAR k )(5) k=0

56

Rel ative and Absolute Accuracy

where k represents the different criteria plotted on the ROC and n is the number of criteria. Essentially, to compute AUC, the areas underneath the curve between the different points on the ROC are summed by joining the points together with straight lines to create several trapezoids. Added to that sum is the area (triangle) between the most conservative empirical point and (0,0). This addition ensures the total area under the curve is included. Chance-level versus perfect resolution would yield AUC  =  0.5 versus AUC  =  1.0, respectively.2 Both AUC and confidence-based G will only approximate their “true” values. AUC will tend to underestimate the true area under the curve depending on the nature of the underlying evidence distributions. For example, if they are Gaussian, the true ROC is curvilinear (see ROCs on the left of Figure 3.7) and joining the ROC points with straight lines will mean that some of the area that should be under the curves will actually fall above it. This problem is minimized with more criteria. In the case of G, ties (pairs that are neither concordant nor discordant) will distort the computed value (Masson & Rotello, 2009) compared to a true value of G that is based on the population without any ties. To eliminate these problems, in addition to computing AUC and G for the six simulations shown in Figures 3.6 and 3.7 based on the 11 criteria, we also computed true values of these indices. For G, we followed Masson and Rotello (2009) and sampled 200,000 pairs of observations, one each from the U and R distributions, and computed V, the proportion of concordant pairs divided by the total number of concordant and discordant pairs. In particular, V=

ad (6) ad + bc

The measurement precision involved in the sampling process meant that there were no ties (i.e., there were no pairs consisting of identical values). Without ties, Nelson (1984) demonstrated that G = 2V − 1(7)

allowing us to derive a population-based value of G which should be unaffected by bias. To derive a true value of AUC, we again sampled 200,000 observations from each of the U and R distributions, but instead of using only 11 criteria to estimate AUC, we used 10,000 (i.e., n  =  10,000 in formula 5). Computing AUC with this many criteria would

essentially eliminate any loss of area due to cutting off the curves on the ROC. The results of our simulations are shown in Figure 3.9. Considering first the Gaussian simulations (top panel of Figure 3.9), it is clear that bias affects both indices. For G, the confidence-based computation overestimates true G in all cases, even if there is no bias, just as Masson and Rotello (2009) found. For AUC, 11 criteria produce a value very similar to true AUC if there is no bias. However, there is underestimation of the true value if the criteria are placed liberally or conservatively on the internal dimension. This underestimation occurs because the long straight line between (0, 0) and the most conservative point on the ROC in the liberal case, and between (1, 1) and the most liberal point on the ROC in the conservative case, eliminates some of the true area under the ROC from the computation (see the middle-left and bottom-left ROCs in Figure 3.7). The results of the rectangular simulations are shown in the bottom panel of Figure 3.9. Here, confidence-based G performs somewhat better in

that the true value of G is not overestimated as much as with Gaussian distributions. AUC also performs well in all bias conditions such that AUC computed from 11 criteria corresponding to 0, 10, 20… 100% is very close to the true area based on 10,000 criteria. If there is no bias, the empirical and true values of AUC are almost identical, which is not true of G. In summary, all the indices we have considered (G and d' computed at each criterion as well as confidence-based G and AUC) are affected by bias under some circumstances. d' performs well with equal-variance Gaussian distributions except at very extreme criterion placements, but does not fare well with rectangular distributions. On the other hand, G based on a 2 × 2 design performs poorly regardless of the type of distributions, assuming more extreme values at both liberal and conservative criterion locations. G based on confidence ratings also shows an effect of bias if the distributions are Gaussian, and overestimates true G even if there is no bias. However, the overestimation is somewhat less if the distributions are rectangular. Like confidence-based

EV Gaussian

0.90 0.85 0.80 0.75 0.70 0.65 0.60

GConf

GTrue

AUC

AUCTrue

AUC

AUCTrue

Rectangular

0.90 0.80 0.70 0.60 0.50 0.40

GConf No Bias

GTrue Liberal Bias

Conservative Bias

Figure 3.9  AUC (area under the ROC curve), AUCTrue (AUC computed with 10,000 criteria), GConf (G computed from confidence ratings), and GTrue (true G without ties computed directly from sampling pairs from the evidence distributions) as a function of bias. G = Goodman-Kruskal gamma coefficient; EV = equal variance.

Higham, Zawadzka, and Hancz akowski

57

G, AUC is also affected if the criteria are clustered at one end or the other of the internal dimension (i.e., all criteria are liberal or conservative) but estimates the true area more closely than G if the criteria are spread evenly across the dimension. Furthermore, it performs well regardless of bias if the underlying distributions are rectangular. In our view, AUC is the best measure of resolution of the ones examined if (1) criteria are not clustered at one end or the other of the internal dimension and (2) the nature of the underlying distributions is not known. Regarding “1,” the criterion placements can be estimated from an ROC of the experimental data (see Figure 3.7 for examples of ROCs with no bias, liberal bias or conservative bias) and adjustments made to the experimental procedure if clustering tends to be occurring. However, Zawadzka and Higham’s (2016) results discussed earlier suggest that, rather than clustering their criteria, participants attempt to evenly distribute them over the full range of the dimension by, for example, shifting criteria to accommodate new items that occupy a new place on the internal dimension. Regarding “2,” parametric indices such as d' (or da, which is analogous to d' except that it is more suitable for cases in which evidence distributions have unequal variance) are very desirable if the underlying distributions are known to be Gaussian. This is the case in recognition memory research where countless studies have shown that an unequal-variance Gaussian model simulates performance very well (e.g., Wixted, 2007; Wixted & Mickes, 2010; Wixted & Stretch, 2004). However, presently, very little is known about the evidence distributions underpinning metacognitive tasks (although see Higham, 2007), so the assumption that they are Gaussian may not be warranted in some cases. Certainly, violations of normality exist for the later cycles of the UWP paradigm, making d' a very poor measure of performance (e.g., see the performance of d' with rectangular distributions in Figure 3.8). In addition, there are other more practical reasons to prefer AUC to d' (or da). Note that several points are missing in the d' plots for the rectangular simulations in Figure 3.8. This occurred because of HRs and/or FARs of 1 or 0 for which z-scores, necessary to compute d' (and da) are undefined (see formula 4). Correction factors to avoid this issue have been proposed (e.g., see Macmillan & Creelman, 2005), but these may not be suitable for metacognitive tasks: unlike stimulus-contingent discrimination tasks such as old/new recognition 58

Rel ative and Absolute Accuracy

memory for which there are typically a high and equal number of items populating the noise (lure) and signal-plus-noise (target) distributions, the number of items populating the evidence distributions can vary greatly in metacognitive tasks. For example, in a JOL task, a participant with poor recall may have only 10% of the total number of items populating the R distribution. We have found that when the number of items determining the HR or FAR is low, the correction factors can have a large, undesirable effect on d' and da. In contrast, AUC can still be computed if the HR and/or FAR is 1 or 0, which avoids the problem of what to do with undefined values.

Future Directions

The common theme of the discussion of calibration and resolution presented here is the use of signal detection theory to describe metacognitive processes. The signal detection approach offers both accurate measures of resolution and theoretical insights into metacognitive judgments that can serve as a basis for computing calibration measures. We believe that adopting the signal detection framework in metacognitive research will thus not only allow researchers more accurately measure metacognitive processes, but will also open new avenues for further experimentation. One of the benefits of adopting this framework lies in the fact that the same framework is currently very much in use in recognition memory research (e.g., Wixted & Mickes, 2010). Although recognition research is different from metacognitive research inasmuch as it focuses on a different underlying dimension—memory evidence as opposed to metacognitive evidence discussed here—it seems reasonable to assume that the decisional processes described by concepts like bias or criterion malleability may share commonalities between recognition and metacognitive domains. We argue thus that some research in the area of recognition memory may serve as an inspiration for related research in the metacognitive area. One example of an issue examined in recognition memory that could be potentially addressed in metacognitive research is the question of within- and between-lists criterion shifts. A substantial body of recognition research has for some time been devoted to the issue of how people decide how much evidence is necessary for calling a recognition probe “old”. Studies have by and large shown that this criterion placement is determined at the start of a recognition test based on a predicted difficulty of the test

and remains unchanged if the difficulty is changed in the course of the test (e.g., Verde & Rotello, 2007; although see Bruno, Higham, & Perfect, 2009). Considering metacognitive processes in terms of the signal detection framework suggests similar questions that could be asked for various metacognitive judgments and decisions. Thus, if decisions to bet that a certain item will be recalled are described as a type of bias within the signal detection framework of metacognition, then it becomes interesting if these item-based decisions are determined by the overall difficulty of the studied materials and if they are sensitive to changes in difficulty occurring in the course of the study phase. Similar questions can be asked in respect to, for example, restudy decisions or decisions to volunteer a candidate response in a memory report. We are currently pursuing answers to some of these questions.

Conclusions

In this chapter we have considered the impact that mapping 1 (shown in Figures 3.1 and 3.2) has on both calibration and resolution. The impact is profound. For calibration, unless the scale values that participants apply to items are taken at face value to reflect underlying subjective probability, then the direct comparison of them to ascertain whether participants are UC, OC or realistic is largely a meaningless exercise. If, instead, the assignment of scale values to internal levels of the underlying subjective dimension is malleable and dependent on such things as the experimental context, payoffs for metacognitive hits and false alarms, and personal beliefs, as countless ROC studies in other areas of experimental psychology suggest (e.g., recognition memory), then the scale values cannot be taken at face value to mean much of anything in an absolute sense. Instead, the impact of mapping 1 must be taken into consideration when considering the notion of calibration and seeking convergence of several measures (e.g., binary tasks as well as tasks involving scales). In our view, this is the way forward for calibration research. As far as resolution is concerned, mapping 1 clearly has an effect on most indices of performance and needs to be taken into account if any measure is going to be accurately interpreted. In our view, AUC with multiple criteria spread across the full range of the internal dimension occupied by the stimuli is one of the more robust measures. It also has a number of practical advantages such as being very straightforward to compute, having a simple,

direct relationship to G, and not requiring correction factors that can distort resolution indices if the HR and/or FAR is equal to 0 or 1. We recommend that metacognitive theorists use AUC to index resolution in future.

Notes

1. Note, however, that the criterion was excellently placed for the U/R distinction, which was the primary task. 2. Higham and Higham (2016) have shown that true AUC and true G (i.e., accurate values of these indices derived from population distributions) are directly related. In particular, G = 2AUC—1. In other words, G is twice the area between the ROC curve and the chance diagonal line. Hence, there is nothing wrong with true G as a measure of resolution; it is as diagnostic as AUC. Rather, the problems that have been identified with G are due to the computational formula that is typically used to estimate it (which uses the principles of concordance and discordance and typically involves tied observations). Higham and Higham demonstrated that this distortion to G can be reduced dramatically if the notion of concordant and discordant observations is abandoned and G is computed using the trapezoidal rule instead.

References

Ariel, R., & Dunlosky, J. (2011). The sensitivity of judgment-oflearning resolution to past test performance, new learning, and forgetting. Memory & Cognition, 39, 171–184. doi:http://dx.doi.org/10.3758/s13421-010-0002-y Benjamin, A. S., & Diaz, M. (2008). Measurement of relative metamnemonic accuracy. In J. Dunlosky & R. A. Bjork (Eds.), Handbook of memory and metamemory (pp. 73–94). New York, NY: Psychology Press. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3. doi:10.117 5/1520-0493(1950)0782.0.CO;2 Bruno, D., Higham, P. A., & Perfect, T. J. (2009). Global subjective memorability and the strength-based mirror effect in recognition memory. Memory & Cognition, 37, 807–818. doi :10.1080/09658211.2010.517757 Castel, A. D., McCabe, D. P., & Roediger, H. L. (2007). Illusions of competence and overestimation of associative memory for identical items:  Evidence from judgments of learning. Psychonomic Bulletin & Review, 14, 107–111. doi:10.3758/ BF03194036 Dunlosky, J., & Rawson, K. A. (2012). Overconfidence produces underachievement:  Inaccurate self-evaluations undermine students’ learning and retention. Learning and Instruction, 22, 271–280. doi:10.1016/j.learninstruc.2011.08.003 England, B. D., & Serra, M. J. (2012). The contributions of anchoring and past-test performance to the underconfidence-with-practice effect. Psychonomic Bulletin & Review, 19, 715–722. doi:10.3758/s13423-012-0237-7 Ferrell, W. R. (1994). Discrete subjective probabilities and decision analysis. In G. Wright & P. Ayton (Eds.), Subjective probability (pp. 411–451). New York, NY: Wiley. Ferrell, W. R. (1995). A model for realism of confidence judgments: Implications for underconfidence in sensory discrimination. Perception and Psychophysics, 57, 246–254. doi:http:// dx.doi.org/10.3758/BF03206511

Higham, Zawadzka, and Hancz akowski

59

Ferrell, W. R., & McGoey, P. J. (1978). A model of calibration for subjective probabilities (Human Factors & Man-Machine Systems Laboratory Report). Tucson, AZ:  University of Arizona, Systems & Industrial Engineering Dept. Ferrell, W. R., & McGoey, P. J. (1980). A model of calibration for subjective probabilities. Organizational Behavior & Human Performance, 26, 32–53. doi:10.1016/0030-5073(80)90045-8 Ferrell, W. R., & Rehm, K. (1980). A model of subjective probabilities from small groups. In Proceedings of the Sixteenth Annual Conference on Manual Control (pp. 271–284). Cambridge, MA: MIT Press. Finn, B., & Metcalfe, J. (2007). The role of memory for past test in the underconfidence with practice effect. Journal of Experimental Psychology: Learning, Memory, & Cognition, 33, 238–244. doi:10.1037/0278-7393.33.1.238 Finn, B., & Metcalfe, J. (2008). Judgments of learning are influenced by memory for past test. Journal of Memory and Language, 58, 19–34. doi:10.1016/j.jml.2007.03.006 Frederick, S. W., & Mochon, D. (2012). A scale distortion theory of anchoring. Journal of Experimental Psychology: General, 141, 124–133. http://dx.doi.org/10.1037/a0024006 Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–764. doi: 10.2307/2281536 Green, D., & Moses, F. (1966). On the equivalence of two recognition measures of short-term memory. Psychological Bulletin, 66, 228–234. doi: 10.1037/h0023645 Griffin, T., & Tversky, A. (1992). The weighing of evidence and the determinants of confidence. Cognitive Psychology, 24, 411–435. doi: 10.1016/0010-0285(92)90013-R Gu, H., & Wallsten, T. S. (2001). On setting response criteria for calibrated subjective probability estimates. Journal of Mathematical Psychology, 45, 551–563. doi:  10.1006/ jmps.2000.1337 Hanczakowski, M., Zawadzka, K., Pasek, T., & Higham, P. A. (2013). Calibration of metacognitive judgments:  Insights from the underconfidence-with-practice effect. Journal of Memory and Language, 69, 429–444. doi:  10.1016/j. jml.2013.05.003 Higham, P. A. (2007). No Special K! A signal detection framework for the strategic regulation of memory accuracy. Journal of Experimental Psychology:  General, 136, 1–22. doi: 10.1037/0096-3445.136.1.1 Higham, P. A. (2011). Accuracy discrimination and type-2 signal detection theory: Clarifications, extensions, and an analysis of bias. In P. A. Higham & J. P. Leboe (Eds.), Constructions of remembering and metacognition: Essays in honour of Bruce Whittlesea (pp. 109–127). Basingstoke, England:  Palgrave MacMillan. Higham, P. A. (2013). Regulating accuracy on university tests with the plurality option. Learning and Instruction, 24, 26–36. doi:10.1016/j.learninstruc.2012.08.001 Higham, P. A., & Higham, D. P. (2016). New improved gamma: Enhancing the accuracy of the Goodman-Kruskal gamma coefficient using signal detection theory and the trapezoidal rule. Manuscript in preparation. Higham, P. A., Zawadzka, K., & Hanczakowski, M. (2016). Dissociations between percent-scale and binary metacognitive judgments: Binary is not always best. Manuscript in preparation. Jang, Y., Wallsten, T. S., & Huber, D. E. (2012). A stochastic detection and retrieval model for the study of metacognition. Psychological Review, 119, 186–200. doi:10.1037/ a0025960

60

Rel ative and Absolute Accuracy

Koriat, A. (1997). Monitoring one’s own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology:  General, 126, 349–370. doi:10.1037/0096-3445.126.4.349 Koriat, A. (2012). The self-consistency model of subjective confidence. Psychological Review, 119, 80–113. doi:10.1037/ a0022171 Koriat, A. (2016, this volume). The consensuality theory of subjective confidence. In J. Dunlosky & S. K. Tauber (Eds.), Oxford handbook of metamemory. New York, NY: Oxford University Press. Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for confidence. Journal of Experimental Psychology:  Human Learning and Memory, 6, 107–118. doi:10.1037//0278-7393.6.2.107 Koriat, A., Sheffer, L., & Ma’ayan, H. (2002). Comparing objective and subjective learning curves:  Judgments of learning exhibit increased underconfidence with practice. Journal of Experimental Psychology:  General, 131, 147–162. doi:10.1037/0096-3445.131.2.147 Liberman, V. (2004). Local and global judgments of confidence. Journal of Experimental Psychology:  Learning, Memory and Cognition, 30, 729–732. doi:10.1037/0278-7393.30.3.729 Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20, 159–183. doi:10.1016/0030-5073(77)90001-0 Luna, K., Higham, P. A., & Martin-Luengo, B. (2011). Regulation of memory accuracy with multiple answers: The plurality option. Journal of Experimental Psychology: Applied, 17, 148–158. doi:10.1037/a0023276 Luna, K., & Martin-Luengo, B. (2012). Improving the accuracy of eyewitnesses in the presence of misinformation with the plurality option. Applied Cognitive Psychology, 26, 687–693. doi: 10.1002/acp.2845 Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user’s guide (2nd ed.). Mahwah, NJ: Erlbaum. Masson, M. E. J., & Rotello, C. M. (2009). Sources of bias in the Goodman–Kruskal gamma coefficient measure of association: Implications for studies of metacognitive processes. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 35, 509–527. doi:10.1037/a0014876 McGillivray, S., & Castel, A. D. (2011). Betting on memory leads to metacognitive improvement by younger and older adults. Psychology and Aging, 26, 137–142. doi:10.1037/a0022681 Meeter, M., & Nelson, T. O. (2003). Multiple study trials and judgments of learning. Acta Psychologica, 113, 123–132. doi:10.1016/S0001-6918(03)00023-4 Montgomery, H. (1975). Direct estimation:  Effect of methodological factors on scale type. Scandinavian Journal of Psychology, 16, 19–29. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological Bulletin, 95, 109–133. doi:10.1037/0033-2909.95.1.109 Nelson, T. O. (1986a). BASIC programs for computation of the Goodman–Kruskal gamma coefficient. Bulletin of the Psychonomic Society, 24, 281–283. doi:10.3758/BF03330141 Nelson, T. O. (1986b). ROC curves and measures of discrimination accuracy: A reply to Swets. Psychological Bulletin, 100, 128–132. doi:10.1037/0033-2909.100.1.128 Parducci, A. (1963). Range-frequency compromise in judgment. Psychological Monographs:  General and Applied, 77, 1–50. doi:10.1037/h0093829

Parducci, A., & Perrett, L. F. (1971). Category rating scales: Effects of relative spacing and frequency of stimulus values. Journal of Experimental Psychology, 89, 427–452. doi: 10.1037/h0031258 Pollack, I., Norman, D., & Galanter, E. (1964). An efficient nonparametric analysis of recognition memory. Psychonomic Science, 1, 327–328. Poulton, E. C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86, 777–803. doi:10.1037/0033-2909.86.4.777 Rotello, C., Masson, M., & Verde, M. (2008). Type I error rates and power analyses for single-point sensitivity measures. Perception & Psychophysics, 70, 389–401. doi:http://dx.doi. org/10.3758/PP.70.2.389 Scheck, P., & Nelson, T. O. (2005). Lack of pervasiveness of the underconfidence-with-practice effect:  Boundary conditions and an explanation via anchoring. Journal of Experimental Psychology:  General, 134, 124–128. doi:10.1037/0096-3445.134.1.124 Sieck, W. R., Merkle, E. C., & Van Zandt, T. (2007). Option fixation:  A  cognitive contributor to overconfidence. Organizational Behavior and Human Decision Processes, 103, 68–83. doi:10.1016/j.obhdp.2006.11.001 Smith, M., & Ferrell, W. R. (1983). The effect of base rate on calibration of subjective probability for true-false questions: Model and experiment. In P. Humphreys, O. Svenson, & A. Vari (Eds.), Analyzing and aiding decisions (pp. 469–488). Amsterdam, the Netherlands: North-Holland.

Tidwell, J., Chrabaszcz, J., & Dougherty, M. (2016, this volume). Sources of overconfidence in judgment and decision. In J. Dunlosky & S.K. Tauber (Eds.), Oxford Handbook of Metamemory. New York, NY: Oxford University Press. Verde, M. F., & Rotello, C. M., (2007). Memory strength and the decision process in recognition memory. Memory & Cognition, 35, 254–262. doi:10.3758/BF03193446 Wallsten, T. S., & Gonzalez-Vallejo, C. (1994). Statement verification: A stochastic model of judgment and response. Psychological Review, 101, 490–504. doi:10.1037/0033-2 95X.101.3.490 Wixted, J. T. (2007). Dual-process theory and signal-detection theory of recognition memory. Psychological Review, 114, 152–176. doi:10.1037/0033-295X.114.1.152 Wixted, J. T., & Mickes, L. (2010). A continuous dual-process model of remember/know judgments. Psychological Review, 117, 1025–1054. doi:10.1037/a0020874 Wixted, J. T., & Stretch, V. (2004). In defense of the signal-detection interpretation of remember/know judgments. Psychonomic Bulletin & Review, 11, 616–641. doi:10.3758/BF03196616 Zawadzka, K., & Higham, P. A. (2016). Recalibration effects in judgments of learning. Manuscript submitted for publication. Zawadzka, K., & Higham, P. A. (2015). Judgments of learning index relative confidence not subjective probability. Memory & Cognition 43, 1168–1179. doi: 10.3758/ s13421-015-0532-4.

Higham, Zawadzka, and Hancz akowski

61

PA RT  

Metamemory Monitoring: Classical Judgments

2

CH A PT E R

4

Judgments of Learning: Methods, Data, and Theory

Matthew G. Rhodes

Abstract Several decades of research have examined predictions of future memory performance—typically referred to as judgments of learning (JOLs). In this chapter, I first discuss the early history of research on JOLs and their fit within a leading metacognitive framework. A common methodological approach has evolved that permits the researcher to investigate the correspondence between JOLs and memory performance, as well as the degree to which JOLs distinguish between information that is or is not remembered. Factors that influence each aspect of the accuracy of JOLs are noted and considered within theoretical approaches to JOLs. Thus far, research on JOLs had yielded a number of findings and promising theoretical frameworks that will continue to be refined. Future work will benefit by considering how learners combine information to arrive at a judgment, the implications of alternative methods of measuring JOLs, and the potential for JOLs to influence memory. Key Words:  metamemory, monitoring, judgments of learning, memory predictions, self-regulation of learning

Consider the following scenario. A  student is preparing for an exam on the first century CE of the Roman Empire and is attempting to understand the vacillation between peace and conflict, exemplified by the various emperors from Caesar to Trajan. After careful review of the life and role of each emperor, the student evaluates whether learning has been sufficient so as to assure some degree of successful performance on the upcoming test. Will this prediction of future memory success prove to be accurate? What information has informed the prediction? Questions of this sort have been examined formally for nearly 50  years by soliciting predictions of future memory performance, termed judgments of learning (JOLs). In the present chapter, I review the fruits of this literature, focusing first on JOLs within a broader framework of metacognition and the earliest research on predictions of future memory performance. Next, I consider typical methods of soliciting JOLs and the general accuracy of JOLs,

highlighting variables that do or do not influence judgment. These findings are considered within the key theoretical frameworks that have been offered to explain how individuals arrive at a prediction of future memory performance. Finally, I  discuss emerging issues that may characterize future research on JOLs.

Judgments of Learning within the Nelson and Narens (1990) Monitoring and Control Framework

In their classic framework for considering metacognition (i.e., our awareness of our own cognition), Nelson and Narens (1990; see also Nelson, 1996; Thiede, Dunlosky, & Mueller, this volume) distinguished between processes related to assessing one’s learning (monitoring) and the self-regulation of learning based on information gained from monitoring (control). For our example student, reflection on whether information had been mastered would comprise monitoring with control processes 65

reflected by a decision on whether to engage in further study of an emperor. The link between monitoring and control is a fundamental assumption of any research in metacognition. Indeed, monitoring appears to influence control over learning even when at odds with objective indices of learning (Metcalfe & Finn, 2008; Rhodes & Castel, 2009). However, although individuals may be imperfect at monitoring the contents of cognition, “A system that monitors itself (even imperfectly) may use its own introspections as input to alter the system’s behavior” (Nelson & Narens, 1990, p. 128, italics in original). Accordingly, JOLs constitute one type of introspective judgment that provides input to a metacognitive system that may alter behavior. Within the Nelson and Narens (1990) framework, JOLs were listed as one of several prospective monitoring judgments possible in anticipation of future memory performance. Whereas JOLs occur during information acquisition (i.e., encoding) or even during storage, other prospective judgments take place prior to encoding (e.g., ease of learning judgments) or following unsuccessful retrieval (e.g., feeling-ofknowing judgments). These prospective judgments, considering the future state of memory, may be contrasted with retrospective judgments that occur after a memory has been retrieved and involve reflecting on the current products of memory. The most common example is the retrospective confidence judgment, whereby individuals indicate their confidence that some bit of retrieved information is accurate. Although prospective judgments are considered distinct from retrospective judgments, relatively little research has compared the different judgments under the same encoding and retrieval conditions. The work that has been done suggests that prospective judgments rely to some extent on qualitatively different information than retrospective judgments (Dougherty, Scheck, Nelson, & Narens, 2005; see also Busey, Tunnicliff, Loftus, & Loftus, 2000). As well, different types of prospective judgments may rely on different sources of information (but see Dunlosky & Tauber, 2013). For example, Leonesio and Nelson (1990) reported that ease of learning judgments were less accurate predictors of future memory performance than JOLs.

A Brief History of Early Research on Judgments of Learning

Much like other modern studies of metacognition (e.g., Hart, 1965; see ­chapter 1 of this volume), systematic research on JOLs began only within the past 50 years, and was initially pursued sporadically, 66

Judgments of Learning

with gaps of years sometimes occurring between publications. Arbuckle and Cuddy (1969) reported the first investigation of predictions of future memory performance made during encoding (i.e., JOLs). Their interest was in whether participants could identify differences in the associative strength of two items and thus make memory predictions that were consistent with associative strength. Accordingly, in two experiments, participants studied sets of paired associates (word-number pairs or word-word pairs) and either made a yes/no judgment of whether a target would be recalled (Experiment 1)  or made predictions on a five-point Likert scale (Experiment 2). The results from both experiments showed that predictions were generally accurate, at the very least exceeding a criterion of chance performance. For example, items given a “yes” prediction were more likely to be recalled than items given a “no” prediction, with “yes” predictions more likely for strongly than weakly related pairs. Likewise, items rated as “very likely” to be recalled were recalled more frequently than items rated as “unlikely” to be recalled. Arbuckle and Cuddy (1969) concluded their paper with an optimistic assessment of the accuracy of memory predictions and suggested an agenda for future research, including exploring factors that might influence predictions even when memory performance was unaffected. Their suggestions for future work went largely unheeded at the time. Instead, memory predictions were explored only intermittently and often confined to a burgeoning line of work examining metacognition in children spearheaded by John Flavell (Flavell, 1979). For example, Flavell, Friedrichs, and Hoyt (1970) used a variant of the JOL procedure with young children who saw pictures of objects and were asked to predict the maximum number of pictures that could be perfectly recalled. In general, such memory span predictions exceeded performance, with children of kindergarten age or younger more likely to exhibit overconfidence (see also Levin, Yussen, Pressley, & de Rose, 1977; Yussen & Levy, 1975). These experiments with children captured only one aspect of JOLs, a general prediction of overall performance based on exposure to a subset of material, but not online predictions made during learning. Indeed, Groninger (1976) reported the first investigation after Arbuckle and Cuddy (1969) to solicit predictions of memory performance during learning. Groninger’s (1976; see also Groninger, 1979) participants studied a list composed of several classes of words (concrete, abstract, nonsense, emotional) and

made confidence judgments regarding the likelihood of later recognition for each word. In general, participants were more likely to subsequently recognize items given high confidence judgments during learning, and judgments were reported to be sensitive to the type of items studied, with participants most confident of remembering emotional items. The decade from 1980 to 1990 similarly yielded few papers on JOLs, but was characterized by some significant contributions. King, Zechmeister, and Shaughnessy (1980) had participants make JOLs for paired associates presented multiple times across several blocks. JOLs were made either in the context of multiple study opportunities or amidst alternating study and test opportunities. Testing enhanced performance on a final test of recall and also yielded more accurate predictions than studying alone. Specifically, King et  al. (1980) demonstrated that JOLs were markedly higher for previously tested items that had been recalled compared with items that had not been recalled (see also Lovelace, 1984), presaging later accounts of how participants arrive at memory predictions during multi-trial learning (e.g., Finn & Metcalfe, 2007). Nevertheless, by 1990, the first two decades of work on JOLs had yielded some important insights (see also Mazzoni, Cornoldi, & Marchitelli, 1990; Vesonder & Voss, 1985) but few publications and data. Indeed, in their comprehensive review and theoretical treatise on metacognition, Nelson and Narens (1990) cite only three published papers that collected data on JOLs. However, this was merely a prelude to an explosion of work on JOLs that is the primary focus of the remainder of the chapter.

Methodology and Notable Findings

The methodology for soliciting JOLs has changed minimally since Arbuckle and Cuddy’s (1969) original paper. For example, in a typical experiment, participants are presented with memoranda, such as paired associates (e.g., Table-Spoon), one at a time. Either immediately after the presentation of a pair or at some delay, participants are prompted to make a JOL of the likelihood of later recalling the target (e.g., Spoon), often cued by the stimulus (e.g., Table-?). Finally, after making JOLs for each item, participants are given a memory test for the studied items.1 Several variants on this approach have been reported. For example, Nelson, Narens, and Dunlosky (2004) created the prejudgment recall and monitoring (PRAM) procedure to assess the contents of memory that inform predictions by having participants engage in retrieval just prior to

providing a JOL. Also, Castel (2008) introduced the prestudy JOL procedure, whereby participants are given information about the nature of a study item (e.g., its serial position within a list) and make their JOL prior to actually viewing this item. Unlike Arbuckle and Cuddy (1969), who solicited judgments on a 5-point Likert scale, most recent work on JOLs has asked participants to make their judgment as a probability, or percentage likelihood, of future successful memory performance2 (but see the Pending Issues and Future Directions section for other judgments). Using a percentage scale permits investigators to calculate measures of absolute accuracy (i.e., calibration), the overall correspondence between judgment and performance (see Higham, Zawadzka, & Hanczakowski, this volume, for additional details). Consider a participant who has studied 10 faces and made a JOL for each face prior to a recognition test. If the average JOL was 70%, then absolute accuracy would be perfect if 70% of the studied faces were subsequently recognized. Judgment would be characterized by overconfidence if JOLs exceeded performance and underconfidence if performance exceeded JOL magnitude. There is some debate as to whether participants can faithfully translate memory predictions to a probability scale (e.g., Hanczakowski, Zawadzka, Pasek, & Higham, 2013), but this approach remains the most common given that the judgment (probability of recall/recognition) can be made on the same scale as the measure of memory. Whereas absolute accuracy refers to the overall correspondence between judgment and performance, relative accuracy (i.e., resolution) refers to the degree to which an individual’s JOLs distinguish between what is and what is not remembered. Relative accuracy is typically measured via the Kruskal-Goodman gamma correlation, a nonparametric index of association that ranges from –1.0 to +1.0 and quantifies the association between JOLs and memory performance (Nelson, 1984; Gonzalez & Nelson, 1996). To the degree that subsequently remembered items are given high JOLs and items that are less likely to be remembered are given lower JOLs, gamma will be positive. Likewise, to the degree that subsequently remembered items are given low JOLs and items that are less likely to be remembered are given higher JOLs, gamma will be negative. Gamma correlations at or approaching zero suggest that there is no relationship between JOLs and subsequent memory performance. Accordingly, the ideal learner’s memory predictions would be considered accurate to the extent Rhodes

67

that JOL magnitude matched memory performance and to the extent that JOLs discriminated between information that was or was not remembered. Given that absolute and relative accuracy reflect different components of accuracy, major findings should be considered in light of those measures (Koriat, 1997).

Factors That Influence the Absolute Accuracy of Judgments of Learning

Absolute accuracy is affected if a given factor influences (a) the magnitude of JOLs and/or (b) the likelihood that target information is remembered. An exhaustive account of all factors that affect absolute accuracy is beyond the scope of this review (see Schwartz & Efklides, 2012, for a partial list). However, I consider influences on JOL magnitude in terms of factors that characterize the item or information to be remembered (e.g., association strength, concreteness, etc.) and factors that characterize the conditions of study and testing (e.g., presentation rate, encoding operations, test format, retention interval, etc.).3

Item-Based Influences on JOL Magnitude.

The original work on JOLs (Arbuckle & Cuddy, 1969) demonstrated that learners deemed related pairs of items (e.g., Table-Chair) to be more memorable than unrelated pairs of items (e.g., Horse-Rugby). The increase in JOL magnitude for related items has been replicated consistently (e.g., Castel, McCabe, & Roediger, 2007; Koriat, 1997) and is evident regardless of whether relatedness is manipulated between or within subjects (Dunlosky & Matvey, 2001). Relatedness appears to have such a substantial influence on JOLs that participants regard related items to be more memorable than unrelated items even when the opposite is true. For example, Caroll, Nelson, and Kirwan (1997) had participants overlearn unrelated pairs (studied to a criterion of eight correct recalls) relative to related pairs (studied to a criterion of two correct recalls). When tested after a 2 or 6 week retention interval, cued recall was superior for the unrelated compared with the related pairs. However, despite memory performance favoring unrelated items, JOLs were far greater for related items. Indeed, at the 6-week interval, JOLs were vastly more overconfident with respect to memory performance for related compared with unrelated pairs. Koriat and Bjork (2005; 2006a) have further shown that participants’ JOLs are sensitive to relatedness even under circumstances that undermine memory. More subtle variations, 68

Judgments of Learning

such as the potential set of items related to a target, have less pronounced effects on JOLs (e.g., Eakin & Hertzog, 2012). While such manipulations rely on the associative strength between items, JOL magnitude is also influenced by a variety of factors that are inherent to an item and not contingent on its relationship with other information. For example, participants provide higher JOLs for concrete (donkey) relative to abstract (truth) items, consistent with memory performance (Tauber & Rhodes, 2012a; see also Hertzog, Dunlosky, Robinson, & Kidder, 2003). Likewise, JOLs are generally higher for emotional words (e.g., Groninger, 1976; Tauber & Dunlosky, 2012; Zimmerman & Kelley, 2010) and faces (Nomi, Rhodes, & Cleary, 2013). Indeed, emotion appears to elevate JOLs independently of memory performance under some circumstances. For example, Zimmerman and Kelley (2010) had participants study pairs with negative target words (e.g., prison-cancer), positive target words (e.g., table-fame) and neutral target words (e.g., violin-avenue). Participants made a JOL for each item, predicting the likelihood of recalling the target (e.g., avenue) given the cue word (e.g., violin) and later attempted to recall the target given the cue. JOLs were similar for positive and negative word pairs and exceeded those for neutral word pairs. However, whereas JOLs for positive word pairs accurately predicted memory performance, JOLs for negative words were characterized by high levels of overconfidence (but see Tauber & Dunlosky, 2012, and Zimmerman and Kelley, 2010, for different results with free recall). Other research has demonstrated that changes in the appearance of to-be-remembered information may have a significant influence on memory predictions (Busey et  al., 2000; Rhodes & Castel, 2008; Rhodes & Castel, 2009; Sungkhasettee, Friedman, & Castel, 2011; Yue, Castel, & Bjork, 2013). For example, Rhodes and Castel (2008) had participants study and make JOLs for a list words that varied in the size of the type used to present each item. Namely, half of the words were presented in a large type size (48 pt) and half were presented in a smaller type size (18 pt). The results are shown in Figure 4.1. Overall, participants consistently provided higher JOLs for large compared with small words, although no differences in memory performance were detected. Thus, large words results in greater levels of overconfidence regarding recall than small words. Rhodes and Castel (2009) reported an auditory analog to this finding, demonstrating that loud words were given higher JOLs than quieter

70

JOL

Mean Percentage

60

Recall

50 40 30 20 10 0

18 pt

48 pt

Figure 4.1  Mean JOLs (black bars) and mean percentage recalled (white bars) by participants exposed to small (18 pt) and large (48 pt) words (adapted from Rhodes & Castel, 2008). Participants provided significantly higher JOLs for larger compared with small words. However, there was no difference in the percentage of words recalled as a function of the size of the word. Note: JOL = judgment of learning.

words, even when memory performance was essentially identical. One method of linking these findings is to suggest that loud or large words are more easily perceived (more fluent) and thus, more fluent materials are accorded higher JOLs. However, this account is contradicted by findings that clarity does not always engender higher JOLs (Yue et al., 2013). Regardless, it is apparent that JOLs may be driven by factors that are specific to an item.

Encoding and Retrieval Influences on JOL Magnitude.

Ideally, an individual’s memory predictions should be sensitive to conditions that bear on memory performance and largely immune to conditions that have minimal effects on memory. That is, if condition X is a boon to memory and condition Y a bane to memory, then JOLs should conform to this pattern, with the magnitude of JOLs for condition X exceeding those for condition Y. In general, the literature suggests a modest concordance with this ideal pattern but is also characterized by a number of instances in which individuals’ JOLs appear insensitive or inadequately sensitive to factors that have considerable effects on memory. As a case example of this mixture of acuity and insensitivity to conditions that influence memory performance, consider work by Shaw and Craik (1989). They had participants study nouns paired with one of three types of cues:  a letter cue (e.g., “starts with ic:  ice”), a rhyme cue (e.g., “rhymes with dice: ice”), or a category/descriptive cue (e.g., “something slippery:  ice”). After studying each item, participants provided a JOL of the likelihood

of later remembering the target when it was paired with the studied cue (e.g., “rhymes with dice:  ?”). Several decades of work on depth of processing have indicated that the degree to which learners consider meaning during encoding is positively related to retention (Craik & Lockhart, 1972). Shaw and Craik’s (1989) results were no different, as cued recall was reliably better for items associated with category information than rhyming information, which in turn resulted in superior recall than items cued based on letters. By extension, participants’ memory predictions should, ideally, reflect this relationship, with JOL magnitude greatest for items considered in terms of categorical information and lowest for items considered in terms of orthographic information. To some extent this pattern of memory predictions was evident:  JOLs were greater for category and rhyming items than items cued by the specification of letters. However, memory predictions did not differ for category versus rhyming items, despite a robust memory advantage for category items. Thus, participants exhibited partial sensitivity to conditions that influenced encoding (but see Bieman-Copland & Charness, 1994; Dunlosky & Nelson, 1994). The broader literature is likewise a boom and bust cycle indicating that, with some notable exceptions, the magnitude of participants’ JOLs often fails to distinguish between effective versus ineffective encoding activities. For example, two of the most powerful methods of enhancing memory are spacing and testing. Spacing (see Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006, for a review) refers to the memorial advantage that accrues when information Rhodes

69

is studied multiple times, but not consecutively (i.e., presentations of the same item are separated by at least one other item), compared with massing information (i.e., consecutive presentations of the same item). The testing effect (see Roediger & Butler, 2011, for a review) refers to the finding that retention is superior when learners engage in some form of retrieval of previously studied information relative to simply restudying (rereading) this information. Both methods are highly effective mnemonic strategies but JOL magnitude does not consistently or adequately reflect these benefits. For example, Logan, Castel, Haber, and Viehman (2012; see also Zechmeister & Shaughnessy, 1980) had participants study a list of words in which some words were repeated immediately (massed study) and some were repeated after a lag of at least three items (spaced study). Participants made a JOL after each presentation and were later administered a free recall test. Logan et al. (2012) observed that participants’ JOLs were marginally higher for spaced compared with massed items. However, this difference in JOLs (approximately 2 percentage points) was dwarfed by the memory advantage apparent for spaced items (approximately 16 percentage points). Indeed, participants’ JOLs generally underestimated the robust benefits of spacing. A similar pattern is apparent for studies of the testing effect, whereby participants provide comparable JOLs for tested compared with restudied information despite testing benefits (King et  al., 1980; Kornell & Rhodes, 2013; Roediger & Karpicke, 2006; but see also Tullis, Finley, & Benjamin, 2013). Other work has also demonstrated that the magnitude of participants’ memory predictions may ignore factors that have a significant influence on memory. For example, participants’ JOLs often fail to account for the ameliorative effect of additional study opportunities after an initial test (Kornell & Bjork, 2009; Kornell, Rhodes, Castel, & Tauber, 2011) and may not distinguish between long versus short retention intervals (Koriat, Bjork, Sheffer, & Bar, 2004), particularly if the interval is manipulated between-subjects. However, it would be a gross mischaracterization to suggest that participant’s are uniformly oblivious to factors that enhance or hinder memory. For example, the magnitude of participants’ JOLs differentiates between recognition versus recall tests (Groninger, 1979; Thiede, 1996; Thiede & Dunlosky, 1994) and is sensitive to variations in the number of items to be learned (Tauber & Rhodes, 2010a). Participants’ JOLs also often 70

Judgments of Learning

favor effective study techniques such as generation (e.g., Begg, Vinski, Frankovich, & Holgate, 1991; Castel, Rhodes, & Friedman, 2013) or interactive imagery (Dunlosky & Nelson, 1994) over less effective techniques. More important, JOL magnitude may be altered by the updated knowledge that accumulates from a recent prior learning experience (Bieman-Copland & Charness, 1994; Castel, 2008; Hertzog & Dunlosky, 2000; Koriat & Bjork, 2006a; Tauber & Rhodes, 2010b). For example, Castel (2008) observed that participants’ JOLs closely corresponded with the effects of order (i.e., serial position) on recall following practice, coupled with salient information about order during encoding. Likewise, Tauber and Rhodes (2010b) reported that participants were better able to predict differences in memory for names versus occupations after a specific prior experience with learning names. Such practice does not guarantee that JOL magnitude will be adjusted appropriately (see e.g., Koriat, Sheffer, & Ma’ayan, 2002; Logan et al., 2012) but suggests that participants can update their knowledge of factors that influence memory under some circumstances.

Factors That Influence the Relative Accuracy of Judgments of Learning

As noted previously, relative accuracy refers to the degree to which an individual’s JOLs distinguish between information that will or will not be remembered. In other words, relative accuracy describes the rank-ordering of JOLs with respect to memory performance. Accordingly, the ideal learner is presumed to provide higher JOLs for remembered information compared with information that will not be remembered. There remains some controversy over Nelson’s (1984) proposal that the Kruskal-Goodman gamma correlation is the appropriate measure of relative accuracy (see e.g., Benjamin & Diaz, 2008; Masson & Rotello, 2009, for criticisms and alternative measures). However, given its predominance in research on JOLs, I discuss relative accuracy in reference to the ubiquitous gamma correlation. In particular, three factors with a common mechanism have a substantial influence on the relative accuracy of memory predictions.

Testing.

In addition to being a potent method of enhancing retention, the act of retrieval also produce strong levels of relative accuracy. King and colleagues (1980) first documented the benefits of testing for memory predictions, reporting that

JOLs were far more accurate when participants experienced some test trials during learning compared with conditions that only involved studying. Subsequent work has confirmed that testing (e.g., Kornell & Rhodes, 2013) or conditions that foster retrieval (e.g., Dunlosky & Nelson, 1992; see JOL Timing for additional details) provide a significant boost to relative accuracy compared with conditions that only require individuals to review materials or discourage testing. One account of this finding is that participants use retrieval success as an index of future memory performance, allocating high JOLs to retrieved items and low JOLs to items that were not retrieved (Finn & Metcalfe, 2007; King et al., 1980). Given that current retrieval will often remain stable on a future test (but see Schmidt & Bjork, 1992), retrieval success leads to highly accurate predictions (cf. Dougherty et al., 2005).

Practice.

Relative accuracy appears to improve markedly when the same information is presented across multiple study-test cycles. The most striking example of this is the underconfidence-with-practice (UWP) effect, characterized by decrements to absolute accuracy and benefits to relative accuracy (Koriat et al., 2002). Specifically, across repeated study-test cycles with the same information, participants often exhibit overconfidence (i.e., memory predictions exceed performance) on a first study-test trial followed by underconfidence (i.e., memory performance exceeds predictions) on later study-test trials. In contrast, relative accuracy generally increases across trials. Tauber and Rhodes (2012b) report data emblematic of this pattern. They had participants make JOLs and receive a memory test for pairs of unrelated words in three study-test trials. Recall increased across trials, going from a mean of approximately 35% on Trial 1 to 72% by Trial 2, and 83% by Trial 3.  JOLs initially exceeded memory performance on Trial 1 (50%) and then were unconfident for Trials 2 (58%) and 3 (76%). However, relative accuracy increased steadily from Trial 1 (Gamma = 0.23) to Trial 2 (Gamma = 0.64) to Trial 3 (Gamma = 0.82). Such benefits to relative accuracy as a function of practice have been widely observed (e.g., Finn & Metcalfe, 2007, 2008; Koriat, 1997; Koriat et  al., 2002; Koriat, Ma’ayan, Sheffer, & Bjork, 2006). Much like explanations of the metacognitive advantages of testing, the predominant account of this finding is that participants use success or failure of retrieval as a basis for JOLs (Finn & Metcalfe, 2007,

2008). For example, during a second study opportunity, low JOLs are likely to be assigned to items identified as forgotten on the first test, whereas higher JOLs are likely to be assigned to items identified as recalled on the first test. This will generally beget JOLs that are high in relative accuracy but may also underestimate new learning that occurs (e.g., some previously forgotten information may now be learned), leading to the underconfidence evident on later study-test cycles. Thus, although other factors certainly inform multi-trial learning (Ariel & Dunlosky, 2011; England & Serra, 2012; Tauber & Rhodes, 2012b), retrieval is the predominant factor driving judgment accuracy. This also suggests an important caveat when considering the influence of practice on relative accuracy. Namely, when each study-test cycle contains new information, gamma correlations appear to remain largely consistent and do not improve across trials (e.g., Koriat & Bjork, 2006b; Matvey, Dunlosky, Shaw, Parks, & Hertzog, 2002; Tauber & Rhodes, 2010a). Given that introducing new items in each study-test cycle entails that participants cannot rely on past retrieval success or failure, their JOLs are less effective at differentiating items that will or will not be remembered. As a consequence, practice has little impact on relative accuracy under these circumstances.

Timing.

Following the procedure outlined by Arbuckle and Cuddy (1969), the first two decades of research on JOLs generally involved soliciting judgments immediately after an item was studied. However, Nelson and Dunlosky (1991; see also Begg, Duft, Lalonde, Melnick, & Sanvito, 1989) reported that varying the timing of JOLs had a dramatic influence on relative accuracy. They had participants study a list of 66 unrelated paired associates (e.g., Table-Spoon). For half of these items, JOLs were made immediately after an item’s presentation, whereas for the remaining half of the items JOLs were delayed such that at least ten items intervened prior to making a JOL. Following this study phase participants received a cued recall test. Delaying JOLs led to a striking pattern, with gamma correlations far greater for delayed JOLs (Gamma = 0.90) compared with immediate JOLs (G  =  0.38). The benefit of delaying JOLs was so substantial that Nelson and Dunlosky (1991) noted that, “Every subjects’ accuracy on delayed JOL was greater than the mean of those same subjects’ accuracy on immediate JOL” (p. 269). Nelson and Dunlosky coined Rhodes

71

their finding the delayed JOL effect. Subsequent work has generally confirmed the benefits of delaying judgment (e.g., Dunlosky & Nelson, 1992; 1994; 1997; Kelemen & Weaver, 1997; Koriat & Ma’ayan, 2005; Nelson, et al., 2004). For example, in their meta-analysis of 45 studies (112 effect sizes) of the delayed JOL effect, Rhodes and Tauber (2011a) observed that delaying JOLs was characterized by a nearly one standard deviation (g = 0.93) increase in gamma correlations compared with immediate JOLs. The predominant account of the delayed JOL effect suggests that delaying judgment encourages participants to attempt retrieval from long-term memory, with that information informing judgment (Nelson & Dunlosky, 1991; see also Dunlosky & Nelson, 1992, 1994; Nelson et al., 2004), akin to the previously discussed benefits of testing for relative accuracy. Accordingly, delayed JOLs are guided by the products of retrieval (Rhodes & Tauber, 2011b). In contrast, immediate JOLs are presumed to reflect information accessible in memory right after study that will be less diagnostic of future memory performance. Dunlosky and Nelson (1992) report compelling evidence regarding the importance of retrieval for the delayed JOL effect. They varied the cue used to solicit JOLs such that half of their participants provided JOLs solicited via a cue and target (e.g., Table-Spoon) and half provided JOLs prompted with the cue alone (e.g., Table-?). As can be seen in Figure 4.2, a robust delayed JOL advantage was evident for JOLs solicited by the cue

Median Gamma Corelation

1.00

alone but not with the cue and target. This presumably occurred because soliciting a JOL with the cue and target eliminates the opportunity to interrogate long-term memory and thus robs the learner of diagnostic information. It should be noted that other theories dispute this account. For example, an alternative perspective is that when participants successfully retrieve a target during a delayed JOL they will likely ascribe a higher JOL to that item than to items that were not retrieved. Because successful retrieval provides a boost to memory, the act of making a delayed JOL enhances memory and ensures that the high JOL is accurate (Spellman & Bjork, 1992; see the section Influence of Judgment on Memory for additional details). Nevertheless, regardless of the explanation, delaying judgment appears to be one of the most robust methods of enhancing relative accuracy.

Theoretical Accounts of the Bases of Judgments of Learning

As has been documented, a vast array of data has been collected on the factors that do or do not influence JOL. But an overriding question remains:  How do individuals make JOLs? Several theoretical approaches have been offered in answer to this question. These approaches can be broadly classified into two categories of theory comprising direct access accounts and inferential accounts (King et  al., 1980; Koriat, 1997). The distinction reflects the amount and type of information the learner has available to make predictions.

Immediate JOL Delayed JOL

0.80 0.60 0.40 0.20 0.00

Cue Alone

Cue & Target

Figure 4.2  Median gamma correlations for immediate (black bars) and delayed (white bars) JOLs solicited by the cue alone (e.g., Table-?) and the cue and target (e.g., Table-Spoon). The timing of JOLs did not influence relative accuracy when JOLs were solicited via the cue and target. In contrast, relative accuracy was much greater for delayed compared with immediate JOLs when solicited via the cue alone (adapted from Dunlosky & Nelson, 1992). Note: JOL = judgment of learning.

72

Judgments of Learning

Direct Access Accounts

Arbuckle and Cuddy (1969) conducted their original experiment on JOLs under the premise that if “items differed in associative strengths immediately following presentation, [participants] should be able to detect these differences just as they can detect differences in strength of any other form of input signal” (p. 126). That is, participants should be able to make judgments by assessing the strength of a memory trace. A  corollary of this account is that individuals have privileged access to the contents of memory that allows one to directly assess the efficacy of encoding and differentiate between items that have or have not been well-learned. Accordingly, direct access accounts propose that JOLs are a translation of the very contents (i.e., strength) of memory onto the scale used for judgment (Cohen, Sandler, & Keglevich, 1991). The direct access view appears to have been met with skepticism even among early investigators. For example, King et  al. (1980) noted that “Although this hypothesis [direct access] is intuitively appealing, we know of no evidence providing direct support for the hypothesis” (p.  340). King et  al. did not articulate what standard of evidence would be necessary to evaluate a direct-access hypothesis, but a means of evaluation was suggested by Koriat (1997). In particular, Koriat notes that a direct-access hypothesis predicts a strong correspondence between memory and JOLs because both are putatively based on the same factor (trace strength). By this standard, it is difficult to sustain a direct-access account of JOLs given the sheer number of demonstrations of stark discrepancies between JOLs and memory performance (see e.g., the summary of research on absolute accuracy). As well, any correspondence between JOLs and memory performance is just as amenable to alternative explanations (e.g., individuals’ knowledge about memory) as an account predicated on access to the strength of memory traces. Thus, despite several sophisticated attempts at understanding the interplay of memory strength and JOLs (e.g., Jang & Nelson, 2005), direct access accounts have rarely been favored.

Inferential Accounts

Rather than reflecting direct access to the contents of memory, inferential accounts hold that JOLs are based on a variety of information available during learning. This could include the type of information being studied (e.g., related vs. unrelated items), whether an item was recalled previously, the

type of test to be administered, the rate at which information is presented, and so on. Koriat (1997) has proposed a cue-utilization framework to organize the many cues that might influence JOLs. His framework starts with the assumption that “JOLs are based on the implicit application of rules or heuristics in order to achieve a reasonable assessment of the probability that the information in question will be recalled or recognized at some later time” (p.  350). As such, “JOLs are accurate as long as the cues used at the time of making the judgments are consistent with the factors that affect subsequent performance on the criterion memory test” (p. 350). Thus, JOLs are inferences that are made based on the cues available during learning. Koriat (1997) has specified three classes of cues that inform JOLs. Intrinsic cues refer to characteristics of the items to be learned that may influence (or are deemed to influence) learning. This includes perceptual characteristics of the items (e.g., size, clarity), associative relatedness, concreteness, emotional qualities of the stimuli, and essentially any other characteristics inherent to the item. Extrinsic cues refer to the conditions of encoding or testing and the processes applied by the learner. Examples include presentation rate (Groninger, 1976), spacing vs. massing study (Logan et al., 2012), recall vs. recognition tests (Thiede & Dunlosky, 1994), or the depth of processing applied to study items (Craik & Shaw, 1989; Dunlosky & Nelson, 1994). Whereas intrinsic and extrinsic cues refer to information external to the learner (e.g., the type of item and type of processing), mnemonic cues refer to internal indices that reflect the learner’s memorial experience of an item. This might include how easily an answer comes to mind in response to a cue (Benjamin, Bjork, & Schwartz, 1998), memory for a prior test (e.g., Finn & Metcalfe, 2007), and the familiarity of a cue (Maki, 1999), among many other possibilities. The cue-utilization framework can readily explain the many discrepancies between JOLs and memory performance that have been documented (e.g., Benjamin et  al., 1998; Carroll et  al., 1997; Kornell & Bjork, 2009; Rhodes & Castel, 2008). From this view, discrepancies arise because the cues used by the learner to inform JOLs are unrelated to actual memory performance. As illustrative of this approach, consider findings reported by Benjamin et al. (1998). They had participants answer a series of general knowledge questions (e.g., What is the largest desert on earth?). Benjamin et  al. indexed the difficulty of each question as the latency of Rhodes

73

indicating that an answer was available, grouping each participant’s responses into quartiles based on the speed of responding. After answering the question, participants made a JOL of the likelihood of successfully engaging in free recall of the answer on a later test. The results are shown in Figure 4.3. As can be seen, the answers retrieved most quickly (Quartile 1) were given the highest JOLs; however, the opposite pattern was apparent for recall, as items with the longest latencies (Quartile 4)  were most likely to be recalled. Thus, participants used a mnemonic cue (retrieval latency) as a basis for JOLs but inferred the wrong relationship between retrieval latency and free recall success. That is, although retrieval latency is likely to be a strong marker of success given a cue such as the original question posed, it is a poor basis for JOLs when those cues will be unavailable. Such a framework is appealing in its explanatory power: Any pattern of JOLs can be explained as a match or mismatch of the cues available and the factors that drive memory. However, its status as a “framework” necessarily calls for greater precision in delineating and testing those cues that inform JOLs (see Dunlosky & Matvey, 2001). Moreover, the mechanism of influence of any single cue is often ambiguous. For example, as described previously, Rhodes and Castel (2008) reported that participants deemed large words to be more memorable than small words (see Figure 4.1). They noted

that one candidate explanation is that participants perceived larger words to be processed more easily (more fluently) than small words and thus misattributed this ease of processing to indicate a greater ease of retrieval on a later test. Thus, one might propose that a mnemonic cue (experienced ease of processing) drove judgment. However, there is a viable alternative. Specifically, participants may have a general belief that larger items are easier to remember than small items and applied that belief when making JOLs (Mueller, Dunlosky, Tauber, & Rhodes, 2014). This distinction captures a second important element of an inferential account. That is, in addition to using a variety of cues available, participants’ JOLs may rely on their own mnemonic experience while processing an item (experience-based judgments) or use their general knowledge about learning and memory (theory-based judgments) to inform judgment (Kelley & Jacoby, 1996;4 Koriat, 1997; Koriat et al., 2004). With regard to experience-based judgments, a number of researchers have suggested ease processing at encoding or retrieval as a central cue that influences JOLs (Begg et al., 1989; Benjamin et al., 1998; Hertzog et  al., 2000; Koriat & Ma’ayan, 2005; Rhodes & Castel, 2008; Undorf & Erfelder, 2011). For example, Hertzog et al. (2000) reported that JOLs were negatively related to the latency to form an image during encoding (i.e., higher JOLs were accorded items encoded quickly) whereas the

100

JOL Recall

Mean Percentage

80 60 40 20 0

1

2

3

4

Response Time Quartile Figure 4.3  Mean JOLs (black bars) and the mean percentage of previous answers recalled (white bars) as a function of response time quartile (adapted from Benjamin, Bjork, & Schwartz, 1998). Latencies to answer each question during the study phase were grouped into quartiles ranging from the fastest responses (Quartile 1) to the slowest responses (Quartile 4). The magnitude of JOLs was negatively related to response time, such that the highest JOLs were provided to the questions answered most quickly (Quartile 1). In contrast, the percentage of answers recalled was positively related to response latencies, such that the highest levels of recall were evident for the slowest quartile (Quartile 4). Note: JOL = judgment of learning.

74

Judgments of Learning

ease of forming an image was unrelated to recall. Likewise, Benjamin et  al. (1998) also observed a negative relationship between retrieval latency and JOLs (see Figure 4.3). Theory-based judgments are evident in demonstrations that JOLs frequently conform to general beliefs about memory. For example, participants’ JOLs appear to reflect the belief that generated items are more memorable than items that have been read (Begg et al., 1991; Castel et al., 2013), that diagrams are easier to remember than text (Serra & Dunlosky, 2010), and that important information is highly memorable (Soderstrom & McCabe, 2011). There is some evidence that experience may sometimes usurp belief (theory) even when a strong belief is in place. For example, Koriat et al. (2004) had participants study paired associates and asked them to make JOLs anticipating either a 5 minute, 1 day, or 1 week retention interval. Although individuals have a strong belief that the length of a retention interval is negatively related to memory performance, JOLs did not reflect this pattern when the retention interval was manipulated between subjects. Rather, participants provided similar JOLs regardless of the length of the retention interval (but see Tauber & Rhodes, 2012a). Kornell has similarly demonstrated that participants will often ignore the benefits of additional study-test opportunities when predicting future learning (e.g., Kornell & Bjork, 2009; Kornell et al., 2011). At present, the circumstances under which JOLs are driven by experience, belief, or some combination of those influences remains poorly understood. Part of this reflects an often-untested assumption about whether JOLs are driven by belief or experience. As an example, consider the influence of relatedness on JOLs. This may indicate a general belief that related items (e.g., Nurse-Doctor) are easier to remember than unrelated items or it could reflect enhanced processing fluency emanating from two items that have a strong association. Mueller, Tauber, and Dunlosky (2013) attempted to address this issue by (a)  evaluating beliefs about relatedness and (b)  competitively assessing the influence of relatedness and a measure of processing fluency. To assess belief, Mueller et al. (2013) used Castel’s (2008) prestudy JOL procedure, whereby participants make a JOL prior to seeing an item, rendering it impossible for the fluency of a particular item to drive judgment. Accordingly, participants made prestudy JOLs, cued only with an indication of whether the item was related or unrelated. Under those circumstances, participants’ prestudy

JOLs were far greater for related compared with unrelated items, suggesting a strong belief that related items are easier to remember than unrelated items. In another experiment, Mueller et  al. had participants make lexical decision judgments (to assess ease of processing) on pairs of items and also solicited JOLs. Overall, JOLs were associated with relatedness (r = 0.66) and lexical decision times (r = –0.21). However, controlling for response latencies had no impact on the magnitude of the correlation between JOLs and relatedness. Thus, these data indicate that the influence of relatedness on JOLs is predominantly a function of belief rather than experience. Additional investigations of this nature will be necessary to more firmly understand how belief and experience contribute to JOLs.

Pending Issues and Future Directions

Although work on monitoring and JOLs is well beyond its nascent stage as a domain of research in metacognition, a number of issues remain to be resolved and necessitate more attention. For example, as noted in the previous section, the role of belief and experience in driving JOLs has still to be elucidated. In addition, JOLs have largely been confined to verbal materials, particularly the paired associate. In order to advance understanding of memory predictions, a larger variety of materials, tests, and encoding conditions remain to be explored (see e.g., Rhodes, Sitzman, & Rowland, 2013; Simon & Bjork, 2001; Tauber, Dunlosky, Rawson, Wahlheim, & Jacoby, 2013, for efforts in this vein). These are just two of a number of issues that merit attention and will drive future research in metacognition.

Methods of Soliciting Predictions

The modal method of soliciting predictions of future memory performance is to simply have participants rate, on a Likert or percentage scale, the likelihood that some bit of information will be remembered in the future. Recent work has sought to determine whether altering the method of judging future remembering influences the accuracy of predictions (e.g., Finn, 2008; Hanczakowski et al., 2013; McGillivray & Castel, 2011; Tauber & Rhodes, 2012a; McCabe & Soderstrom, 2011). For example, Finn (2008) compared typical JOLs with instructions to indicate the likelihood that information would be forgotten on a future test (termed judgments of forgetting). Judging forgetting did not affect relative accuracy but did influence absolute accuracy, making participants’ judgments Rhodes

75

generally more conservative compared with JOLs (see also Koriat et  al., 2004). McCabe and Soderstrom (2011) reported that asking participants to predict whether an item would be accompanied by contextual details at test improved relative accuracy compared with JOLs. Further, Tauber and Rhodes (2012a) asked participants to indicate how long (in minutes) information would be remembered. In contrast to work indicating that participants may deem information equally likely to be remembered in 1 week or 5 minutes (Koriat et al., 2004), Tauber and Rhodes’ (2012a) participants provided modest predictions, indicating that information would be remembered, on average, for approximately 15 minutes. Although intriguing, these data do not engender a clear interpretation of how the scale used affects judgment. For example, the method of soliciting judgment may alter response distributions without changing the underlying information that informs judgment (cf. Nelson, Leonesio, Landwehr, & Narens, 1986). Indeed, it is not apparent whether these alternative judgments induce participants to consider different kinds of information about memory or reflect a variation in judgment based on the same information (Dunlosky & Tauber, 2013). For example, Serra and England (2012) demonstrated that soliciting judgments of forgetting may lead participants to use a different anchor (i.e., intial value for considering judgments) than JOLs but does not represent a qualitative change in the type of information considered. Similarly, asking for estimates of the amount of time information will be remembered has no impact on relative accuracy compared with standard JOLs, suggesting that discrimination is not affected by that judgment (Tauber & Rhodes, 2012a). However, there is some indication that altering the method of soliciting judgment may change the information under consideration, potentially ameliorating predictions. As noted, McCabe and Soderstrom (2011) reported that asking about future memory states improved relative accuracy. Further, Soderstrom and Rhodes (2014) have shown that such predictions may diminish a potent metacognitive illusion. Future work should continue to refine methods of soliciting JOLs in light of whether or not the framing of a judgment changes the information under consideration.

Influence of Judgment on Memory

In their original report on JOLs, Arbuckle and Cuddy (1969) included a condition that 76

Judgments of Learning

did not provide JOLs during the learning phase in order to determine whether JOLs interfered with learning. Their results indicated precisely the opposite:  Participants required to make JOLs exhibited significantly better recall than participants who did not make JOLs. Thus, the act of making a JOL altered the phenomenon under investigation, enhancing memory for the information being judged (Spellman & Bjork, 1992). Surprisingly, relatively few studies also report a condition that assesses the potential impact of JOLs on memory performance (e.g., Benjamin et  al., 1998; Dougherty et al. 2005; Kelemen & Weaver, 1997; King et  al., 1980; Sommer, Leuthold, & Schweinberger, 1995; Sundqvist, Todorov, Kubik, & Jonssön, 2012; Tauber & Rhodes, 2012a). King et al. (1980) varied whether participants received a prediction trial or an additional study trial. They found that both conditions resulted in similar levels of memory performance leading them to conclude that “performing the prediction task was comparable to having an additional study trial” (King et al., p.  336). Likewise, Sommer et  al. (1995) observed superior face recognition for a condition that made JOLs compared with a condition that did not make JOLs. Others have reported no effect (Benjamin et  al., 1998; Tauber & Rhodes, 2012b) or mixed effects of JOLs on memory. For example, Keleman and Weaver (1997) observed no effect of immediate JOLs on retention but did report a memorial benefit following delayed JOLs (see also Sundqvist et al., 2012). The potential for reactivity (i.e., soliciting JOLs alters memory performance) suggests an agenda for future research (a) to include appropriate control conditions to assess the impact of prediction on memory and (b)  to provide a viable explanation of such reactivity. One possible explanation is that JOLs simply increase the amount of exposure time for a particular item and have the same benefits that would accrue with additional study time (King et al., 1980). However, JOLs might also encourage additional encoding operations that provide a unique advantage even when controlling for study time. For example, Dougherty and colleagues (2005) used the PRAM procedure (Nelson et  al., 2004), asking participants to make a recall attempt prior to judging either their confidence that a studied item was recalled or making a JOL concerning the likelihood of future recall. JOLs led to higher levels of recall on a later test than a no-judgment condition but also compared to participants who made retrospective confidence judgments (RCJs).

In explaining this finding, Dougherty et al. (2005) suggested that … the requirement to make JOLs actually alters how well participants are learning the to-be-recalled items at study. Perhaps participants who make JOLs implement a more effective study strategy than participants making RCJs because the JOL task forces them to focus on future retrieval. Interestingly, the finding that there was no effect of making RCJs on any of the recall measures suggests that there is something special about making JOLs that improves learning. (p. 1110)

Future research will profit by examining whether there is indeed something special for learning that results from making JOLs.

Multiple Cues to Judgment

Much of the JOL literature focuses on specific cues in isolation in order to determine whether that cue influences judgment. Although this reveals important determinants of JOLs the approach falls short of enhancing understanding of the type of learning environments experienced by individuals such as the example student highlighted at the beginning of the chapter. In those settings, predictions of learning are likely driven by multiple cues. Accordingly, it is essential to achieve an understanding of how multiple cues function collectively to inform judgment (Ariel & Dunlosky, 2011; Tauber & Rhodes, 2012b). As an example of the importance of examining multiple sources of information in forming JOLs, consider advising a student that predictions of learning will be most accurate when they are delayed. This advice is generally sound but ignores the fact that the benefits of delaying judgment hinge on the type of cue used to solicit judgment. Namely, delayed JOLs accompanied by the cue and the target are little more accurate than immediate JOLs (Dunlosky & Nelson, 1992). It is only when JOLs are solicited in a manner that permits unfettered retrieval from long-term memory (i.e., cue-only JOLs) that the advantage of delayed judgment is evident (see Figure 4.2). Other work does not suggest such interactions but points to independence among cues. For example, although participants believe that important information is more likely to be remembered, manipulating the value of information does not alter the impact of relatedness on JOLs (Soderstrom & McCabe, 2011). Further, relatedness appears to drive judgment and dwarf the impact of other cues, such as the size of words presented for study (Rhodes & Castel, 2008).

Regardless of the nature of the relationship, identifying how multiple sources of information interact or sum to contribute to judgment will be necessary for a full account of the bases and nature of JOLs.

A Few Final Thoughts

The first four and a half decades of research on JOLs have yielded a number of insights into how individuals judge their future learning and have provided a rich array of data to inform theory. Advances over the past 20  years have been rapid and characterized by an increased understanding of the information that drives memory predictions but also a greater appreciation of the complexities of assessing the future state of memory. Although work on JOLs has generated myriad findings, several core themes can be extracted. 1. Individuals are unlikely to have privileged access to the contents and strength of memory. Instead, predictions rely on a host of cues that vary in their diagnosticity. 2. JOLs are the product of a combination of elements from the online experience of learning and general beliefs about memory. The proportional contribution of each to judgment remains an open question. 3. Orienting individuals toward judgments that rely on retrieval after a delay is one of the soundest means to ensure that a learner can discriminate between what has been learned and what has not been learned. These themes will continue to be refined and future work should yield additional insights into how individuals predict learning.

Notes

1. This chapter focuses exclusively on item-by-item judgments of learning. Another approach is to solicit aggregate, or global, predictions of memory performance, such as the total number of items that will be remembered (e.g., Connor, Dunlosky, & Hertzog, 1997). 2. For example, in their meta-analysis of JOLs elicited following a delay compared with immediately after study, Rhodes and Tauber (2011a) observed that 103 of 112 (92%) effect sizes were for JOLs solicited on a percentage scale. 3. It is important to note that the distinction between item-based influences and those present in the conditions of encoding or testing is necessarily imprecise and serves only as a rough organizing framework for considering the wider literature on JOLs. For example, the manner in which an item is processed will be an interactive function of the learner and the information to-be-learned (e.g., consider how individuals with or without knowledge of English would process the word allergy). In addition, the conditions of encoding and/or testing may alter how an item is perceived, as when the same items are studied

Rhodes

77

repeatedly. Thus, although these factors are not mutually exclusive, they are treated separately to facilitate exposition. 4. Kelley and Jacoby (1996) used somewhat different terms, distinguishing between subjective and analytic bases for judgment.

References

Arbuckle, T. Y., & Cuddy, L. L. (1969). Discrimination of item strength at time of presentation. Journal of Experimental Psychology, 81, 126–131. Ariel, R., & Dunlosky, J. (2011). The sensitivity of judgments-oflearning resolution to past test performance, new learning, and forgetting. Memory & Cognition, 39, 171–184. Begg, I., Duft, S., Lalonde, P., Melnick, R., & Sanvito, J. (1989). Memory predictions are based on ease of processing. Journal of Memory and Language, 28, 610–632. Begg, I., Vinski, E., Frankovich, L., & Holgate, B. (1991). Generating makes words memorable, but so does effective reading. Memory & Cognition, 19, 487–497. Benjamin, A. S., Bjork, R. A., & Schwartz, B. L. (1998). The mismeasure of memory: When retrieval fluency is misleading as a metamnemonic index. Journal of Experimental Psychology: General, 127, 55–68. Benjamin, A. S., & Diaz, M. (2008). Measure of relative metamnemonic accuracy. In J. Dunlosky & R. A. Bjork (Eds.), Handbook of memory and metamemory (pp. 73–94). New York, NY: Psychology Press. Bieman-Copland, S., & Charness, N. (1994). Memory knowledge and memory monitoring in adulthood. Psychology and Aging, 9, 287–302. Busey, T. A., Tunnicliff, J., Loftus, G. R., & Loftus, E. F. (2000). Accounts of the confidence-accuracy relation in recognition memory. Psychonomic Bulletin & Review, 7, 26–48. Carroll, M., Nelson, T. O., & Kirwan, A. (1997). Tradeoff of semantic relatedness and degree of overlearning: Differential effects on metamemory and long-term retention. Acta Psychologica, 95, 239–253. Castel, A. D. (2008). Metacognition and learning about primacy and recency effects in free recall:  The utilization of intrinsic and extrinsic cues when making judgments of learning. Memory & Cognition, 36, 429–437. Castel, A. D., McCabe, D. P., & Roediger, H. L. III. (2007). Illusions of competence and overestimation of associative memory for identical items:  Evidence from judgments of learning. Psychonomic Bulletin & Review, 14, 107–111. Castel, A. D., Rhodes, M. G., & Friedman, M. C. (2013). Predicting memory benefits in the production effect: The use and misuse of self-generated distinctive cues when making judgments of learning. Memory & Cognition, 41, 28–35. Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132, 354–380. Cohen, R. L., Sandler, S. P., & Keglevich, L. (1991). The failure of memory monitoring in a free recall task. Canadian Journal of Psychology, 45, 523–538. Connor, L. T., Dunlosky, J., & Hertzog, C. (1997). Age-related differences in absolute but not relative metamemory accuracy. Psychology and Aging, 12, 50–71. Craik, F. I.  M., & Lockhart, R. S. (1972). Levels of processing:  A  framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11, 671–684.

78

Judgments of Learning

Dunlosky, J., & Matvey, G. (2001). Empirical analysis of the intrinsic-extrinsic distinction of judgments of learning (JOLs):  Effects of relatedness and serial position on JOLs. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 27, 1180–1191. Dunlosky, J., & Nelson, T. O. (1992). Importance of the kind of cue for judgments of learning (JOL) and the delayed-JOL effect. Memory & Cognition, 20, 374–380. Dunlosky, J., & Nelson, T. O. (1994). Does the sensitivity of judgments of learning (JOLs) to the effects of various study activities depend on when the JOLs occur? Journal of Memory and Language, 33, 545–565. Dunlosky, J., & Nelson, T. O. (1997). Similarity between the cue for judgments of learning (JOL) and the cue for test is not the primary determinant of JOL accuracy. Journal of Memory and Language, 36, 34–49. Dunlosky, J., & Tauber, S. K. (2013). Understanding people’s metacognitive judgments: An isomechanism framework and its implications for applied and theoretical research. In T. Perfect & S. Lindsay (Eds.) Handbook of Applied Memory. Thousand Oaks, CA: Sage. Dougherty, M. R., Scheck, P., Nelson, T. O., & Narens, L. (2005). Using the past to predict the future. Memory & Cognition, 33, 1096–1115. Eakin, D. K., & Hertzog, C. (2012). Immediate judgments of learning are insensitive to implicit interference effects at retrieval. Memory and Cognition, 40, 8–18. England, B. D., & Serra, M. J. (2012). The contributions of anchoring and past-test performance to the underconfidence-with-practice effect. Psychonomic Bulletin & Review, 19, 715–722. Finn, B. (2008). Framing effects on metacognitive monitoring and control. Memory & Cognition, 36, 813–821. Finn, B., & Metcalfe, J. (2007). The role of memory for past test in the underconfidence with practice effect. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 33, 238–244. Finn, B., & Metcalfe, J. (2008). Judgments of learning are influenced by memory for past test. Journal of Memory and Language, 58, 19–34. Flavell, J. H. (1979). Metacognition and cognitive monitoring:  A  new area of cognitive–developmental inquiry. American Psychologist, 34, 906–911. Flavell, J. H., Friedrichs, A. G., & Hoyt, J. D. (1970). Developmental changes in memorization processes. Cognitive Psychology, 1, 324–340. Gonzalez, R., & Nelson, T. O. (1996). Measuring ordinal association in measures that contain tied scores. Psychological Bulletin, 119, 159–165. Groninger, L. D. (1976). Predicting recognition during storage:  The capacity of the memory system to evaluate itself. Bulletin of the Psychonomic Society, 7, 425–428. Groninger, L. D. (1979). Predicting recall: The” feeling-that-Iwill-know” phenomenon. The American Journal of Psychology, 92, 45–58. Hanczakowski, M., Zawadzka, K., Pasek, T., & Higham, P. A. (2013). Calibration of metacognitive judgments:  Insights from the underconfidence-with-practice effect. Journal of Memory and Language, 69, 429–444. Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208-216. Hertzog, C., & Dunlosky, J. (2000). Updating knowledge about encoding strategies:  A  componential analysis of learning

about strategy effectiveness from task experience. Psychology and Aging, 15, 462–474. Hertzog, C., Dunlosky, J., Robinson, A. E., & Kidder, D. P. (2003). Encoding fluency is a cue used for judgments about learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 22–34. Jang, Y., & Nelson, T. O. (2005). How many dimensions underlie judgments of learning and recall? Evidence from state-trace methodology. Journal of Experimental Psychology:  General, 134, 308–326. Kelemen, W. L., & Weaver, C. (1997). Enhanced memory at delays:  Why do judgments of learning improve over time? Journal of Experimental Psychology:  Learning, Memory, and Cognition, 23, 1394–1409. Kelley, C. M., & Jacoby, L. L. (1996). Adult egocentrism:  Subjective experience versus analytic bases for judgment. Journal of Memory & Language, 35, 157–175. King, J. F., Zechmeister, E. B., & Shaughnessy, J. J. (1980). Judgments of knowing:  The influence of retrieval practice. The American Journal of Psychology, 93, 329–343. Koriat, A. (1997). Monitoring one’s own knowledge during study: A cue-utilization approach to judgments of learning. Journal of Experimental Psychology: General, 126, 349–370. Koriat, A., & Bjork, R. A. (2005). Illusions of competence in monitoring one’s knowledge during study. Journal of Experimental Psychology: Learning, Memory, & Cognition, 31, 187–194. Koriat, A., & Bjork, R. A. (2006a). Illusions of competence during study can be remedied by manipulations that enhance learners’ sensitivity to retrieval conditions at test. Memory & Cognition, 34, 959–972. Koriat, A., & Bjork, R. A. (2006b). Mending metacognitive illusions:  A  comparison of mnemonic-based and theory-based procedures. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 32, 1133–1145. Koriat, A., Bjork, R. A., Sheffer, L., & Bar, S. K. (2004). Predicting one’s own forgetting: The role of experience-based and theory-based processes. Journal of Experimental Psychology: General, 133, 643–656. Koriat, A., & Ma’ayan, H. (2005). The effects of encoding fluency and retrieval fluency on judgments of learning. Journal of Memory and Language, 52, 478–492. Koriat, A., Sheffer, L., & Ma’ayan, H. (2002). Comparing objective and subjective learning curves:  Judgments of learning exhibit increased underconfidence with practice. Journal of Experimental Psychology: General, 131, 147-162. Koriat, A., Ma’ayan, H., Sheffer, L., & Bjork, R. A. (2006). Exploring a mnemonic debiasing account of the underconfidence-with-practice effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 595–608. Kornell, N., & Bjork, R. A. (2009). A stability bias in human memory: Overestimating remembering and underestimating learning. Journal of Experimental Psychology: General, 138, 449–468. Kornell, N., & Rhodes, M. G. (2013). Feedback reduces the metacognitive benefit of tests. Journal of Experimental Psychology: Applied, 19, 1–13. Kornell, N., Rhodes, M. G., Castel, A. D., & Tauber, S. K. (2011). The ease-of-processing heuristic and stability bias:  Dissociating memory, memory beliefs, and memory judgment. Psychological Science, 22, 787–794. Levin, J. R., Yussen, S. R., Pressley, M., & de Rose, T. M. (1977). Developmental changes in assessing recall and recognition memory capacity. Developmental Psychology, 13, 608–615.

Leonesio, R. J., & Nelson, T. O. (1990). Do different metamemory judgments tap the same underlying aspects of memory? Journal of Experimental Psychology:  Learning, Memory, and Cognition, 16, 464–470. Logan, J. M., Castel, A. D., Haber, S., & Viehman, E. J. (2012). Metacognition and the spacing effect:  The role of repetition, feedback, and instruction on judgments of learning for massed and spaced rehearsal. Metacognition and Learning, 7, 175–195. Lovelace, E. A. (1984). Metamemory:  Monitoring future recallability during study. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 756–766. Maki, R. H. (1999). The roles of competition, target accessibility, and cue familiarity in metamemory for word pairs. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 25, 1011–1023. Masson, M. E. J., & Rotello, C. M. (2009). Sources of bias in the Goodman-Kruskal gamma coefficient measure of association: Implications for studies of metacognitive processes. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 35, 509–527. Matvey, G., Dunlosky, J., Shaw, R. J., Parks, C., & Hertzog, C. (2002). Age-related equivalence and deficit in knowledge updating of cue effectiveness. Psychology and Aging, 17, 589–597. Mazzoni, G., Cornoldi, C., & Marchitelli, G. (1990). Do memorability ratings affect study time allocation? Memory & Cognition, 18, 196–204. McCabe, D. P., & Soderstrom, N. C. (2011). Recollection-based metamemory judgments are more accurate than those based on confidence:  Judgments of Remembering and Knowing (JORKs). Journal of Experimental Psychology:  General, 140, 605–621. McGillivray, S., & Castel, A. D. (2011). Betting on memory leads to metacognitive improvement by younger and older adults. Psychology and Aging, 26, 137–142. Metcalfe, J., & Finn, B. (2008). Evidence that judgments of learning are causally related to study choice. Psychonomic Bulletin & Review, 15, 174–179. Mueller, M. L., Tauber, S. K., & Dunlosky, J. (2013). Contributions of beliefs and processing fluency to the effect of relatedness on judgments of learning. Psychonomic Bulletin & Review, 20, 378–384. Mueller, M. L., Dunlosky, J., Tauber, S. K., & Rhodes, M. G. (2014). The font-size effect on judgments of learning: Does it exemplify fluency effects or reflect people’s beliefs about memory? Journal of Memory and Language, 70, 1–12. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological Bulletin, 84, 93–116. Nelson, T. O. (1996). Consciousness and metacognition. American Psychologist, 51, 102–116. Nelson, T. O., & Dunlosky, J. (1991). When people’s judgments of learning (JOLs) are extremely accurate at predicting subsequent recall: The “delayed-JOL effect.” Psychological Science, 2, 267–270. Nelson, T. O., Leonesio, R. J., Landwehr, R. S., & Narens, L. (1986). A comparison of three predictors of an individual’s memory performance:  The individual’s feeling of knowing versus the normative feeling of knowing versus base-rate item difficulty. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 279–287 Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and some new findings. In G. H. Bower (Ed.),

Rhodes

79

The psychology of learning and motivation (pp. 125–173). New York, NY: Academic Press. Nelson, T. O., Narens, L., & Dunlosky, J. (2004). A Revised methodology for research on metamemory:  Pre-judgment recall and monitoring (PRAM). Psychological Methods, 9, 53–69. Nomi, J. S., Rhodes, M. G., & Cleary, A. M. (2013). Effect of emotional facial expressions on predictions of identity recognition. Cognition and Emotion, 27, 141–149 Rhodes, M. G., & Castel, A. D. (2008). Memory predictions are influenced by perceptual information: Evidence for metacognitive illusions. Journal of Experimental Psychology:  General, 137, 615–625. Rhodes, M. G., & Castel, A. D. (2009). Metacognitive illusions for auditory information: Effects on monitoring and control. Psychonomic Bulletin & Review, 16, 550–554. Rhodes, M. G., Sitzman, D. M., & Rowland, C. A. (2013). Monitoring and control of learning own-race and other-race faces. Applied Cognitive Psychology, 27, 553–563. Rhodes, M. G., & Tauber, S. K. (2011a). The influence of delaying judgments of learning (JOLs) on metacognitive accuracy: A meta-analytic review. Psychological Bulletin, 137, 131–148. Rhodes, M. G., & Tauber, S. K. (2011b) Eliminating the delayed JOL effect: The influence of the veracity of retrieved information on metacognitive accuracy. Memory, 19, 853–870. Roediger, H. L., III, & Butler, A. C. (2011). The critical role of retrieval practice in long-term retention. Trends in Cognitive Science, 15, 20–27. Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning:  Taking memory tests improves long-term retention. Psychological Science, 17, 249–255. Schmidt, R. A., & Bjork, R. A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science 3, 207–217. Schwartz, B. L., & Efklides, A. (2012). Metamemory and memory efficiency:  Implications for student learning. Journal of Applied Research in Memory and Cognition, 1, 145–151. Serra, M. J., & Dunlosky, J. (2010). Metacomprehension judgments reflect the belief that diagrams improve learning from text. Memory, 18, 698–711. Serra, M. J., & England, B. D. (2012). Magnitude and accuracy differences between judgments of remembering and forgetting. Quarterly Journal of Experimental Psychology, 65, 2231–2257. Shaw, R. J., & Craik, F. I. M. (1989). Age differences in predictions and performance on a cued recall task. Psychology and Aging, 4, 131–135. Simon, D. A., & Bjork, R. A. (2001). Metacognition in motor learning. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 27, 907–912. Spellman, B. A., & Bjork, R. A. (1992). When predictions create reality:  Judgments of learning may alter what they are intended to assess. Psychological Science, 5, 315–316. Soderstrom, N. C., & McCabe, D. P. (2011). The interplay between value and relatedness as bases for metacognitive monitoring and control:  Evidence for agenda-based monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 1236–1242. Soderstrom, N. C., & Rhodes, M. G. (2014) Metacognitive illusions can be reduced by monitoring recollection during study. Journal of Cognitive Psychology, 26, 118–126.

80

Judgments of Learning

Sommer, W., Heinz, A., Leuthold, H., Matt, J., & Schweinberger, S. R. (1995). Metamemory, distinctiveness, and event-related potentials in recognition memory for faces Memory & Cognition, 23, 1–11. Sundqvist, M. L., Todorov, I., Kubik, V., & Jönsson, F. U. (2012). Study for now, but judge for later:  Delayed judgments of learning promote long-term retention. Scandinavian Journal of Psychology, 53, 450–454. Sungkhasettee, V. W., Friedman, M. C., & Castel, A. D. (2011). Memory and metamemory for inverted words:  Illusions of competency and desirable difficulties. Psychonomic Bulletin & Review, 18, 973–978. Tauber, S. K., & Dunlosky, J. (2012). Can older adults judge their learning of emotional information? Psychology and Aging, 27, 924–933. Tauber, S. K., Dunlosky, J., Rawson, K. A., Wahlheim, C. N., & Jacoby, L. L. (2013). Self-regulated learning of a natural category: Do people interleave or block exemplars during study? Psychonomic Bulletin & Review, 20, 356–363. Tauber, S. K., & Rhodes, M. G. (2010a). Are judgments of learning (JOLs) sensitive to the amount of material to-beremembered? Memory, 18, 351–362. Tauber, S. K., & Rhodes, M. G. (2010b). Metacognitive errors contribute to the difficulty in remembering proper names. Memory, 18, 522–532. Tauber, S. K., & Rhodes M. G. (2012a). Measuring memory monitoring with judgments of retention interval (JOR). Quarterly Journal of Experimental Psychology, 65, 1376–1396. Tauber, S. K., & Rhodes M. G. (2012b). Multiple bases for young and older adults’ judgments-of-learning (JOLs) in multitrial learning. Psychology and Aging, 27, 474–483. Thiede, K. W. (1996). The relative importance of anticipated test format and anticipated test difficulty on performance The Quarterly Journal of Experimental Psychology, 49A, 901–918. Thiede, K. W., & Dunlosky, J. (1994). Delaying students’ metacognitive monitoring improves their accuracy in predicting their recognition performance. Journal of Educational Psychology, 86, 290–302. Tullis, J. G., Finley, J. R., & Benjamin, A. S. (2013). Metacognition of the testing effect:  Guiding learners to predict the benefits of retrieval. Memory & Cognition, 41, 429–442. Undorf, M., & Erdfelder, E. (2011). Judgments of learning reflect encoding fluency:  Conclusive evidence for the ease-of-processing hypothesis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 1264–1269. Vesonder, G. T., & Voss, J. F. (1985). On the ability to predict one’s own responses while learning. Journal of Memory and Language, 24, 363–376. Yue, C. L., Castel, A. D., & Bjork, R. A. (2013). When disfluency is—and is not—a desirable difficulty: The influence of typeface clarity on metacognitive judgments and memory. Memory & Cognition, 41, 229–241. Yussen, S. R., & Levy, V. M. (1975). Developmental changes in predicting one’s own span of short-term memory. Journal of Experimental Child Psychology, 19, 502–508. Zechmeister, E. B., & Shaughnessy, J. J. (1980). When you know that you know and when you think that you know but you don’t. Bulletin of the Psychonomic Society, 15, 41–44. Zimmerman, C. A., & Kelley, C. M. (2010). “I’ll remember this!” Effects of emotionality on memory predictions versus memory performance. Journal of Memory and Language, 62, 240–253.

CH A PT E R

5

Introspecting on the Elusive: The Uncanny State of the Feeling of Knowing

Ayanna K. Thomas, Meeyeon Lee, and Gregory Hughes

Abstract The state of knowing in the absence of knowledge is a peculiar metacognitive phenomenon that intuitively implies that we are able to introspect on memory processes of search, storage, and retrieval. The ability to make this assessment suggests that we may be able to use certain cues to assess the quality of knowledge that may be hidden from conscious view. The focus of this chapter is the uncanny metacognitive state of the feeling of knowing (FOK). We examine the theoretical questions that have motivated research into this phenomenon. These questions are viewed through a historical perspective, allowing for a more complete understanding of how research into this subjective state has evolved. This chapter concludes with a discussion of the present state of the field, examines neurocognitive mechanisms, reviews the questions that presently concern FOK researchers, and proposes an applied direction for future research. Key Words:  feeling of knowing, judgment accuracy, cue familiarity, accessibility, post-retrieval monitoring

“Suppose we try to recall a forgotten name. The state of our consciousness is peculiar. There is a gap therein; but not a mere gap. It is a gap that is intensely active. A sort of wraith of the name is in it, beckoning us to give it direction, making us at moments tingle with the sense of our closeness…. The rhythm of a lost word may be there without a sound to clothe it; or the evanescent sense of something which is the initial vowel or consonant may mock us fitfully without growing more distinct.” —James, 1899, pp. 243–244

This description by William James (1899) eloquently articulates a subjective state that we have all experienced, and it served as a road map for the body of research that would emerge to investigate the feeling of knowing (FOK). Inherent in James’ statement is that we can sense the presence of some inaccessible target information. How we determine its presence has been a primary focus of research on

feeling of knowing (FOK). Critical to our understanding of this experience is the acknowledgment that FOKs are metacognitive judgments about the cognitive system. The foundation of modern understanding of the FOK is the theoretical position that the metacognitive system contains two interrelated levels, which will be referred to as the “meta-level” and the “object-level.” According to this model, depicted in Figure 5.1, information flows from the object-level to the meta-level. It is that information that contributes metacognitive monitoring. Specific to FOK judgments, object information flows to the meta-level after information has been learned. That is, FOKs are judgments about whether a given currently non-recallable item is known and/or will be subsequently remembered. Experimentally, FOKs are typically measured in the context of the recall-judgment-recognition (RJR) paradigm first developed by Hart (1965) to examine the accuracy of the FOK experience (Hart, 1965). Under the RJR protocol, participants 81

META-LEVEL

MODEL

Control

Monitoring

Flow of information

OBJECT-LEVEL

Figure 5.1  The Interactive Relationship Between Monitoring and Control Reprinted with permission from Metacognition: Knowing about Knowing (p. 11) by J. Metcalfe and A. Shimamura (1994). Cambridge MA: Bradford Books.

attempted to recall either general information (Hart, 1965) or information learned in the laboratory (Hart, 1967). A recall trial consisted of giving the participant a cue and prompting retrieval of a target (e.g., “Which planet is the largest in our solar system?” or “What nonsense syllable was paired with frog in the list you just finished studying?”). If target retrieval failed, the participant would be asked to make a yes-or-no FOK judgment. Thus, the FOK was developed to be a judgment of future recognition. Finally, participants were given an N-alternative forced-choice (N-AFC) recognition test to assess the validity of FOKs. Accuracy of FOKs was determined by comparing objective performance on the final recognition test with FOK predictions. The experience of knowing in the absence of accessed knowledge implies that the human cognitive system has some mechanism in place from which it can derive this assessment. Further, the ability to later retrieve information that had earlier accompanied strong FOK states would suggest that there is some predictive value to these assessments. That is, these states must be at least somewhat accurate. The following sections address the predictive value of FOKs, or FOK judgment accuracy, and the various factors that have been posited to influence FOK judgments.

Are Feelings of Knowing Accurate Predictors of Future Retrieval?

Within the context of the model depicted in Figure 5.1, metacognitive monitoring resulting in FOKs has the ability to direct actions taken at the object-level (Nelson & Narens, 1990). For example, a strong FOK may result in continued search of memory, whereas a weak FOK may result in 82

Introspecting on the Elusive

termination of search. Research has demonstrated that FOKs do affect strategy selection (e.g., Reder & Ritter, 1992). As one example, Reder and Ritter (1992) demonstrated that participants were more likely to rely on a memory-based strategy if they had been exposed previously to similar information. In their game show paradigm, participants were presented with arithmetic problems for 850 milliseconds and were asked to choose a particular strategy to solve the problem. They were given two strategies as choices:  retrieval or calculation. Strategy choice was incentivized. Participants were awarded 50 points when they selected the retrieval strategy, when actions and strategy selection were on time, and their answer was correct. Five points were awarded when participants selected the calculation strategy and they answered correctly. This incentive scheme was devised to facilitate a reliance on memory. Regardless of participants’ choices, the correct answer always appeared for two seconds. Finally, throughout the experiment arithmetic problems were presented multiple times, and exposure to parts within given problems varied. Problems were repeated to influence availability of the answers. Exposure to problem parts was varied to influence FOK. Figure 5.2 illustrates this clever paradigm. Toward the end of the experiment, novel problems that looked similar to studied problems were presented. That is, new problems that may have included parts of previously presented problems were presented for strategy decision. This brief 850 millisecond presentation reduced the likelihood that participants could solve the problem, or even remember the solution of the problem, from prior presentation. Thus, participants had to assess the best strategy based solely on the factors presented

Subject says “next” into microphone to start trial 500 ms delay Problem appears on Screen 17 ∗ 23 Subject must press R or C key in 850 ms

If Retrieve pressed

If Calculate pressed

Subject must say correct answer in ~1s to get 50 points

Subject must work out answer (in head or on paper) within ~18 s to get 5 points

On to next trial

Subject gets feedback on correctness, times and points

Forced 2 s study time Figure 5.2  The Game Show Paradigm: Procedure for a single trial (R = retrieve; C = calculate). Reprinted with permission from Reder and Ritter (1992).

in the problem. The best strategy would be the one that not only resulted in an accurate answer, but also resulted in the greatest points reward. Reder and Ritter (1992) found that the retrieve strategy choice was directly affected by previous exposure to problem parts, regardless of whether the entire problem had been seen before. Further, although rapid, these decisions were accurate in the sense that participants were able to assess which strategy was optimal. Participants selected the retrieve strategy when they knew that they could provide the answer within the short retrieve (1 second) deadline. Reder and Ritter (1992) interpreted this finding as support for the notion that people could control cognitive processes based on initial monitoring judgments. That is, participants could select an appropriate strategy based on familiarity with terms in the question. The game show paradigm convincingly demonstrated that participants could assess knowledge and select an appropriate strategy based on that assessment, in the absence of knowledge itself. Thus, within the context of this paradigm, the monitoring process affected control. Reder and Ritter (1992) also demonstrated that in this context monitoring assessments were relatively accurate. That is,

participants effectively assessed whether they could retrieve an answer to the arithmetic problems. With strong evidence that monitoring processes, such as FOK, do affect control, a natural and important question to answer is whether monitoring processes are accurate. If monitoring assessments are inaccurate, or biased, then the resulting control processes would also be likely to be faulty. Not surprisingly, one of the primary questions that shaped FOK research was whether FOKs are predictive of later memory performance (Hart, 1965; Koriat, 1993). In the context of the RJR paradigm, FOK accuracy associated with both semantic and episodic memory has been investigated. As Figure 5.3 illustrates, the primary difference between paradigms used to investigate semantic and episodic FOK accuracy is the point at which FOKs are investigated. In the context of an episodic task, participants are first exposed to new material during an initial encoding phase. The example in Figure 5.3 illustrates a paired-associates episodic task. After encoding, participants are then asked to recall the target of a pair when prompted with the cue. Regardless of accuracy, participants are prompted to provide a FOK judgment. FOK judgment Thomas, Lee, and Hughes

83

EPISODIC FOK PARADIGM Recall and FOK Encode Apple-Desk

Recognize

Apple– How likely are you to recognize the target on a later test? 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Apple Desk Book Window Tree

SEMANTIC FOK PARADIGM Recall and FOK

Recognize

Who wrote the books Pride and Prejudice and Sense and Sensibility?

Who wrote the books Pride and Prejudice and Sense and Sensibility?

How likely are you to recognize the target on a later test?

Elizabeth Gaskell Jane Austin

Charles Dickens Wilkie Collins

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Figure 5.3  Illustration of episodic and semantic FOK paradigms.

scales may vary; however, participants are typically required to predict later recognition performance on some scale. Figure 5.3 depicts a percentage scale. In the context of semantic memory tasks, participants are often given normed, general knowledge questions. The encoding phase is eliminated in such a paradigm because the assumption is that participants have already been exposed to the information being tested. FOK accuracy is examined by comparing FOK predictions to recognition performance. Several techniques have been developed to measure the relationship between FOK judgments and later recall. For a more comprehensive review, we refer the reader to Higham, Zawadzka, and Hanczakowski (this volume). Generally, research has primarily been focused on comparing FOK judgments with later recognition performance. This relationship is understood as the relative accuracy of FOK judgments, and it is defined as the extent to which individuals are able to determine whether their judgment is more or less likely to be correct. Relative accuracy is typically determined by correlating objective memory performance with judgments. Greater accuracy has been demonstrated for semantic memory predictions as compared to episodic memory predictions. For example, research has shown that individuals provided more accurate FOK judgments for proper names as compared to common low-frequency nouns (Izaute, Chambers, & Larochelle, 2002). Izaute et  al. (2002) suggested that semantic information associated with proper names was more specific to the target, and thus 84

Introspecting on the Elusive

more diagnostic, than information associated with common noun targets. The accuracy of FOK judgments has also been known to vary depending on the types of errors in recall performance (Krinsky & Nelson, 1985). For instance, individuals provided higher FOK judgments for items that they answered incorrectly (commission errors) as compared to those that they did not answer (omission errors), although the accuracy of recognition performance has been shown to be the same for those items. Thus, commission errors result in less accurate FOK judgments relative to omission errors (Krinsky & Nelson, 1985). Other factors such as quality of initial encoding (Thomas, Bulevich, & Dubois, 2012) and memory for learning strategy (Hertzog, Fulton, Sinclair, & Dunlosky, 2014) have also been shown to affect FOK judgment accuracy. Specifically, episodic FOKs tend to be more accurate when participants engage in deep as opposed to shallow encoding processes and when participants are able to remember some aspects of the initial learning episode. These results suggest that the original memory representation may be related to FOK judgment accuracy. The relationship between memory representation and FOK accuracy is one that has influenced theoretical models developed to describe the underlying mechanism that governs FOKs. (In the next section of the chapter we discuss in detail the relationship between the memory representation and FOK.) Finally, research has also demonstrated age-related deficits in episodic FOK judgment accuracy (Perrotin, Belleville, & Isingrini, 2007; Souchay,

Moulin, Clarys, Taconnat, & Isingrini, 2007) but found age-invariance in association with semantic memory FOK judgment accuracy (Allen-Burge & Storandt, 2000; Souchay et al., 2007), thus mirroring the age-related semantic-episodic distinction found in memory tasks. These age-related differences suggest that the cues relevant to episodic FOK prediction accuracy may be qualitatively different from those relevant to semantic FOK accuracy. One of the defining characteristics of episodic memory is the inclusion of contextual information (Tulving & Donaldson, 1972). These contextual elements may play a crucial role in episodic FOK accuracy. Older adults may demonstrate age-related decline in episodic FOK accuracy because of reduced access to contextual cues. For more on the topic of age-related changes in metamemory, we refer the reader to Castel, Middlebrooks, and McGillivray (this volume). To summarize, FOK judgments influence subsequent control processes. Individuals select specific cognitive operations based on their subjective assessments of knowing or not knowing. Unfortunately, FOKs are not perfect predictors of memory. FOK accuracy seems to be dependent on access to diagnostic information about initial encoding and cues that become accessible upon memory search. A  more comprehensive discussion of the relationship between diagnosticity of cues and FOK judgment accuracy is presented in the final section of this chapter.

How Are Feeling-of Knowing Judgments Made? The Internal Monitor

Early investigations into FOKs were influenced by the underlying assumption that monitoring knowledge was a process that was independent of memory search. Implied in this assumption is that the underlying process that results in FOKs is guided by an internal monitor that has privileged access to information stored in memory. The internal monitor made assessments regarding the strength of the underlying memory representation. The theoretical approach, proposed by Hart (1967), effectively guided FOK research for nearly 30 years. Importantly, the model successfully accounted for accurate and inaccurate FOK states and adequately accounted for findings of FOK sensitivity to encoding manipulations (Lupker, Harbluk, & Patrick, 1991; Nelson, Gerler, & Narens, 1984; Nelson, Leonesio, Shimamura, Landwehr, & Narens, 1982; Schacter, 1983; Thomas et  al., 2012), as well as

findings of the relationship between partial target information accessibility and FOKs (Blake, 1973; Eysenck, 1979; Koriat & Lieblich, 1974; Thomas, Bulevich, & Dubois, 2011). However, this framework, and the memory mechanisms that influenced the development of this conceptualization of the FOK have largely fallen into disfavor. According to Hart (1965), the FOK is instrumental in determining the length and focus of a memory search. If one has a strong FOK, then one may continue to search for a target. Using the RJR methodology, Hart demonstrated that FOK judgments were relatively good indicators of what participants did and did not know. That is, participants gave higher FOK judgments to items that they later recognized, as compared to those that they failed to recognize. The consistently found pattern led Hart to develop the memory-monitoring framework. The model assumes that while a target may be inaccessible, the Monitor can assess the later retrievability by evaluating the strength of the representation. If memory strength falls above a particular threshold, the Monitor can effectively predict later recognition. If, however, memory strength falls below the MEMO threshold, FOK predictions are likely to be inaccurate. Hart argued that this internal monitor is an important mechanism because it helps determine the nature and focus of a memory search. Hart’s original methodology and theoretical interpretation of the FOK process was extremely influential. Researchers were providing evidence to support an internal monitor mechanism as late as 1987. For example, Yaniv and Meyer (1987) found that FOK predictions were associated with faster decisions on a lexical decision task (LDT). In this study, they presented participants with a series of four rare-word definitions, for which participants tried to retrieve the correct answer or rare word. Rare-word definitions were used to promote conditions of retrieval failure. If they failed to retrieve a word, participants then assessed FOK and tip-of-the-tongue (TOT) for the un-recallable target word. The initial phase is similar to the first phase of a typical semantic FOK experiment, as illustrated in Figure 5.3. After a set of rare-word definitions, participants were given an LDT in which rare words were embedded within a list of new words and nonwords. In this task, participants had to classify various strings of letters as English words or nonwords. For each letter string, a positive or negative decision was made, and the time to make the decision was measured. Importantly, the LDT test stimuli consisted of Thomas, Lee, and Hughes

85

target words that would have qualified as appropriate responses to the definitions presented in the first phase of the experiment. Yaniv and Meyer were interested in the relationship between speed of responding on the LDT and earlier metacognitive assessments. They hypothesized that the LDT would tap residual semantic and episodic memory traces for target words. Thus, participants would make faster lexical decisions for targets that left behind a strong residual trace as compared to those that left weaker traces. In the context of Hart’s memory-monitoring framework, targets that had strong traces would also be those associate with strong FOKs. Yaniv and Meyer (1987) found that rare word definitions that accompanied high-FOK judgments were responded to more quickly on the LDT than definitions given low-FOK assessments. They argued that the memory search engendered by the presentation of rare-word definitions activated associated targets. Although activation may have not been sufficient for retrieval, it was sufficient to facilitate performance on the LDT. Further, the internal monitor was able to independently monitor activation. This is the reason that participants demonstrated a relationship between FOK judgments and LDT performance. The memory search that resulted from the presentation of rare-word definitions activated stored but inaccessible memory traces. Consistent with Hart’s (1965) framework, internal activation was independently assessed, and words that were more strongly activated received higher FOKS. The increased activation also resulted in facilitation on indirect (LDT) and direct (recognition) tests of memory. Yaniv and Meyer’s (1987) interpretation support an independent monitoring mechanism. Further, their interpretation suggests that activation in semantic networks is persistent and relatively long lasting. Unfortunately, the interpretation of the results suggested that primary assumptions inherent in the spreading activation model (cf., Collins & Loftus, 1975) needed significant revision. In this model, semantic memory is proposed to be organized into nodes that represent information units, and those nodes are connected via pathways into a large network. One major assumption of this model is that activation decays rapidly with passing time and intervening activity. Yaniv and Meyer challenged this assumption. They found facilitation on the LDT (indicative of activation) even after relatively long delays of up to 30 minutes. 86

Introspecting on the Elusive

Rather than advocate for an evaluation of the assumptions of the spreading activation framework, Connor, Balota, and Neely (1992) proposed an alternative explanation. They suggested that FOK judgments were influenced by individuals’ explicit assessment of knowledge of a given area and familiarity with various elements of the initial query. In a study designed to test this alternative explanation, Connor, et al. (1992) demonstrated similar facilitatory effects on a lexical decision task even when the accessibility estimates preceded the LDT by a week. That is, the FOK assessment phase in which participants were presented with rare-word definitions occurred a full week before the subsequent LDT. With such a delay between the activation phase and criterial testing, it seems highly unlikely that facilitation seen on the LDT could be the result of subthreshold activation remaining in the semantic network. As further evidence, Connor et al. (1992) demonstrated little change in LDT performance when the task preceded the rare-word definition phase. That is, participants demonstrated similar levels of facilitation on the LDT in the absence of activation from rare-word definition presentation. These findings suggest that activation of information in semantic networks was an unlikely explanation for FOK assessments and LD task performance. This finding, coupled with the development of methodological changes in the way FOKs were measured, resulted in the shift away from a specialized internal monitor module and toward models that hypothesized that FOKs are based on inferential computations and the monitoring of super-threshold information.

The Inferential Basis for Feelings of Knowing

The internal monitor proposed by Hart (1967) accounted for FOK prediction accuracy, explained the utility of the FOK process, was congruent with strength theories of memory, and made intuitive sense. That is, FOKs are relatively accurate because referents to a memory trace can be independently accessed. Just as names of files stored on a computer are located in a separate directory from the files themselves, so are referents to memory traces (see Koriat, 1993, for a similar discussion). Thus, the internal monitor may first access this directory to determine whether to continue a search for the file. This view holds that FOK judgments are the product of a specialized and encapsulated module (Fodor, 1983). Similar to strength-based theories of memory (e.g., Norman & Wickelgren, 1969;

Wickelgren & Norman, 1966), this view holds that FOKs are made based on the strength of the signal. If the signal is above threshold, the FOK judgment is positive; if below threshold, the FOK judgment is negative. Though simple and elegant, the trace-access account has not been well supported. Take for example the findings of Connor et al. (1992), which suggest that FOK judgments may be based on some other inferential process, such as familiarity with the topic. That is, a pattern of LD facilitation for rare words was obtained even under conditions in which the rare-word definition followed the definition’s referent by one week. There was no opportunity for the referent to be influenced by activation of the referent’s node by prior exposure to the definition. As alternatives to the internal monitor, Nelson, Gerler, and Narens (1984) suggested several possible theoretical mechanisms that could underlie FOK. Importantly, they suggested that the FOK does not monitor the nonrecalled target. Rather, they suggested that other information in memory is monitored and serves as the basis for the FOK. Formally, Nelson and Narens (1990) presented the “no-magic hypothesis” for how FOK judgments should be conceptualized. According to this hypothesis, FOKs monitor superthreshold information about remembered attributes, including incorrectly remembered superthreshold information. As evidence for this hypothesis, Nelson and Narens demonstrated that an individual’s FOK was related more to his or her claimed frequency of previous recall than to the actual frequency of recall. That is, FOK judgments were mediated by a person’s belief about learning as opposed to actual learning. According to the no-magic hypothesis, and subsequent views that will also be discussed, there is no single determinant of an FOK. Rather, FOKs are computed from a variety of cues.

made initial evaluations of questions before making attempts to retrieve an answer. These initial evaluations seemed geared toward establishing a strategy for answering. Reder and Ritter demonstrated that participants could quickly determine whether they could retrieve an answer to an arithmetic problem or had to calculate to produce an answer. Participants made this decision (retrieve or calculate) in less than one second, suggesting that participants were basing their decision on something other than access to the target, as this decision time frame was smaller than necessary for retrieval (see Reder & Ritter, 1992). Importantly, participants’ strategy choice was based on familiarity with components of the question as opposed to a separate module that monitored the presence of the target. Providing convergent evidence, Jameson, Narens, Goldfarb, and Nelson (1990) were able to affect memory performance without changing FOKs by priming answers to general knowledge questions. Metcalfe, Schwartz, and Joaquim (1993) systematically varied familiarity with the cue and found that cue familiarity affected FOKs independently of target retrievability. They argued that the FOK process was fast, automatic, and consumed few cognitive resources. They further suggested that FOKs are typically based on surface features of the question and are often not influenced by the retrieval process, which they characterized as slow, cognitively demanding, and not automatic. They characterized the influence of cue familiarity as a heuristic that approximates the presence of an inaccessible target indirectly rather than by directly measuring the strength of the target representation. This particular framework accounted for both accurate and inaccurate FOK states by arguing that the cue familiarity heuristic could result in accurate FOK judgments as long as the familiarity of the cue was correlated with target retrievability.

Cue Familiarity

There is little doubt that familiarity with aspects of a probe question or cue will influence people’s subjective assessment of knowing. Researchers have consistently demonstrated that the cue familiarity heuristic operates rapidly and often without the influence of partial target access. The influence of cue familiarity on FOKs is nonanalytic and inferential in nature. This hypothetical inferential mechanism, while extremely successful in accounting for many findings in the FOK literature, fails to account for results that demonstrate the relationship between partial target information access and FOK judgment magnitude. The relationship

Several studies demonstrated a strong positive relationship between domain knowledge and FOKs, even in the absence of an accurate test performance (Costermans, Lories, & Ansay, 1992; Glenberg, Sanocki, Epstein, & Morris, 1987; Maki & Serra, 1992). These results suggested that familiarity with information contained in the question, or familiarity with the cue, directly influenced FOK judgment magnitude. Further, Reder (1988) found that priming of the cues spuriously increased FOK judgments but had no effect on test performance. Reder and Ritter (1992) demonstrated that people

Accessibility

Thomas, Lee, and Hughes

87

between partial target access and FOKs suggest that targets do influence FOKs. Whereas Hart’s framework suggested a separate mechanism that monitored the presence of the target in memory stores, Koriat (1993) proposed a computational inferential mechanism that accounted for the influence of targets on FOKs. A large body of research has consistently demonstrated that aspects of the target do influence FOKs. Koriat (1993) persuasively argued that retrieval monitoring does influence FOK judgment magnitude. Several important assumptions clearly differentiate Koriat’s framework from previous models that included the role of target retrieval in the FOK process. As opposed to an independent module mechanism or mechanisms that directly monitor the actual correct target (i.e., target retrievability mechanism), Koriat argued that FOKs are based on partial information associated with a target that become accessible during retrieval attempts. When we attempt to retrieve some information, we activate a host of clues and attributes that may help guide our search or locate a target (e.g., A. S. Brown, 1991). When a search fails, and a target is not located, the byproducts of the search remain accessible and may influence one’s FOK. It is that debris that influences the FOK. According to Koriat, speed and quantity of information accessed during a retrieval attempt can be used to compute FOKs. He further suggests that in early stages of FOK assessment, judgments are likely to be made solely on the speed of access and quantity of information, or the accessibility of partial information.

100

In one experiment, Koriat had participants attempt to recall nonword letter strings after study. Participants were encouraged to recall as many letters as they could retrieve. After recall, participants provided FOK judgments regarding the likelihood of recognizing the correct string among lures. As Figure 5.4 illustrates, FOK judgment magnitude increased with the number of both correct and incorrect letters recalled. In both sides of the figure, mean FOK is plotted against retrieved partial information. Figure 5.4) plots mean FOKs as a function of correct partial information, with wrong partial information as a parameter. Figure 5.4) plots mean FOKs as a function of incorrect partial information, with correct partial information as a parameter. The partial information consisted of letters recalled during the FOK phase. When participants could not recall the nonword letter string, they were encouraged to recall letters that may have been present in the string. As Figure 5.4) illustrates, FOK judgments increased with the number of correct letters recalled. Importantly, as Figure 5.4) illustrates, FOK judgments also increased with the number of incorrect letters recalled. Further, when the number of letters recalled was held constant, Koriat found that FOK judgments increased with the ease with which information came to mind, as measured by retrieval latency.

Current Directions: Feeling of Knowing Is Multiply Determined Combined Contributions

A historical perspective of FOK literature allows us to realize that FOKs are not governed by one underlying mechanism. Rather, a host of previously

100 ∗ PI-W = 4

∗ PI-C=4

60

40

I-W

=

-W

PI

3

=

1

I-W

=

MEAN FOK

=2

80

PI -W

MEAN FOK

80

0

P

-C

PI

=3

60

-C

PI 40

C=

PI-

P 20

1

2

3

PI-C

4

1

C=

PI-

20 0

=2

0

1

0

2

3

4

PI-W

Figure 5.4  Mean FOK judgments plotted (left) as a function of correct partial information recall (PI-C) and (right) as a function of incorrect partial information (PI-W) recalled. From Koriat (1993), experiment 1.

88

Introspecting on the Elusive

described mechanisms seem to work in concert to produce the single subjective state. The FOK seems to be computed across time, and as more time passes, more information becomes accessible and influences FOK judgments. Take, for example, two inferential mechanisms already discussed in some detail. The cue-familiarity and accessibility accounts postulate the operation of different inferential computations. The cue-familiarity heuristic posits that the characteristics of the question may influence FOK judgments prior to attempted retrieval (Metcalfe et al., 1993; Schwartz & Metcalfe, 1992). In contrast, the accessibility heuristic postulates that the FOKs are influenced by the amount of partial information and the ease with which that information comes to mind during the act of attempted retrieval (Koriat, 1993, 1995). Early research examining these two mechanisms treated them as competing alternatives (e.g., Maki, 1999); however, more recent studies suggest that the two mechanisms probably operate in tandem (Koriat & Levy-Sadot, 2001). In one study, Koriat and Levy-Sadot (2001) manipulated cue-familiarity and accessibility orthogonally. Importantly, they found that both cue-familiarity and accessibility of information following retrieval affected the magnitude of FOKs. FOKs varied with familiarity of referents in the cue question. FOKs also varied with the size of the category from which a target was drawn. Finally, cue-familiarity and accessibility heuristics interacted in that the effects of accessibility were much weaker for low-familiarity questions as compared to high-familiarity questions. Koriat and Levy-Sadot also demonstrated that cue-familiarity continued to exert an influence after attempted retrieval, indicating that cue-familiarity was not confined to rapid preliminary pre-retrieval judgments. Research continues to support the combined contributions of cue familiarity and target accessibility. As one example, in a recent study, participants were explicitly asked if they used familiarity with the cue and/ or had access to information about the target when making their FOK judgments. In this case, the cue was a famous face and the target was the associated name. Participants indicated that both familiarity with the face and access to partial information associated with the target influenced FOK judgments (Hosey, Peynircioğlu, & Rabinovitz, 2009).

Neuroscientific Support for Combined Contributions

Neurological correlates of FOKs also support the conclusion of the combined contribution of these

two underlying processes. The rapid advancement of the spatial and temporal resolution of brain imaging research over the last two decades has afforded researchers the ability to elucidate the neurological mechanisms of FOK with more precision. Research using neuropsychological and neuroscientific methods demonstrated that FOKs are related to the frontal lobes; more specifically, fronto-temporal and fronto-parietal networks have recently been implicated. Both networks seem to operate in accord with cue familiarity and accessibility mechanisms. Kikyo and Miyashita (2004) proposed that FOK uses the top-down signal from prefrontal cortical regions in monitoring the output and activity of the temporal lobe. In an episodic face-name paradigm, the researchers found that activity in both lateral prefrontal and lateral temporal regions modulated linearly with FOK magnitude; that is, the higher the FOK, the more activity in these regions. In addition, brain regions involved in the visual processing of faces (used as the cue in this task) were activated, suggesting that the processing of the cue also contributes to FOK production. Expanding on this frontal-temporal network, Schnyer et al. (2004) proposed an FOK activation network that includes the ventromedial prefrontal cortex (VMPFC), bilateral inferior frontal cortex (IFC), lateral temporal lobes, and the hippocampus. All of these regions have been consistently identified as neural substrates of FOK (Chua, Schacter, Rand-Giovannetti, & Sperling, 2006; Elman, Klostermann, Marian, Verstaen, & Shimamura, 2012; Kikyo & Miyashita, 2004; Kikyo, Ohki, & Miyashita, 2002; Maril, Simons, Mitchell, Schwartz, & Schacter, 2003; Maril, Simons, Weaver, & Schacter, 2005; Modirrousta & Fellows, 2008; Reggev, Zuckerman, & Maril, 2011). Consistent with the findings of Kikyo and Miyashita (2004), Schnyer found that lateral prefrontal and lateral temporal lobe activity modulated with the magnitude of FOK. Consistent with the accessibility hypothesis, the VMPC exhibited linear modulation with both the magnitude of FOK judgments and temporal lobe activity in an episodic sentence completion task (Schnyer, Nicholls, & Verfaellie, 2005). Alternatively, the IFC seems to be involved in processing the cue, which in turn triggers objects represented in the lateral temporal lobe and stored in the hippocampus. Like the VMPC, the right inferior prefrontal cortex (RIFC) has been implicated in several imaging studies on FOK and appears to be involved in the monitoring and assessment of temporal lobe output (Aron, Robbins, & Poldrack, Thomas, Lee, and Hughes

89

2004). Thus, activation in the RIFC is consistent with involvement of the cue-familiarity mechanisms (Reggev et al., 2011; Schnyer et al., 2005). Unlike the VMPC and lateral temporal cortex, the RIFC tracked with neither FOK magnitude nor temporal lobe activity (Schnyer et  al., 2005). This lack of a graded response suggests that the RIFC is not sensitive to the quantity of temporal lobe output. Rather, it appears that the RIFC assesses the appropriateness of retrieved memory contents. Both the RIFC and VMPC networks might interact in a manner consistent with the combined accessibility/cue familiarity model (Koriat & Levy-Sadot, 2001). If an object is deemed sufficiently familiar, additional retrieval processes seek to locate the target in memory. The RIFC is related to the processing of cues and subsequently activates representations in the temporal lobes. If familiarity of the cue is high, the RIFC activation is high, and it continues to drive activation through the frontal-temporal network. Although the bulk of the previous analysis concerns frontal-temporal activity, the frontal-parietal networks also play a role in FOK. The monitoring role played by the RIFC, for example, may partially rely on parietal lobe activity. Research suggests that the RIFC and parietal cortex regions form a functional network in episodic retrieval processes (Shimamura, 2011). In addition, the parietal cortex plays a role in working memory by holding a representation of attended stimuli online (Jonides et al., 1998). The parietal cortex may aid the RIFC in cue-related processing by holding cue-related information in working memory. Parietal cortical regions are likely to aid the RIFC in assessing both the cue (via its role in working memory) and retrieved memory contents (via its role in episodic memory retrieval).

Concluding Remarks

The FOK is determined by multiple factors (Koriat & Levy-Sadot, 2001; Nelson, 1984). Nonanalytic processes are primarily driven by an appraisal of the cue as well as retrieval attempts to locate the target. Both are crucial in determining FOK judgments. Highly familiar or highly recognizable cues result in a strong FOK. In these cases, FOKs may be quickly determined and may result in a judgment even before retrieval is attempted (Reder & Ritter, 1992). The time course of the FOK may be such that preliminary inspection of the cue may encourage or discourage attempted retrieval of the target (Koriat, 1993, 1995, 1998). 90

Introspecting on the Elusive

For example, perhaps a cue seems so familiar that an individual is able to assess the presence of the target solely based on that familiarity. Conversely, a cue may seem completely unfamiliar and discourage attempted retrieval. In the event that a retrieval attempt is made, information associated with the target may become accessible. This information can be potential target fragments, contextual information related to episodic encoding or semantic associates, or both. Importantly, in the context of the RJR methodology, retrieval is always attempted before FOK judgments are made, likely resulting in the combined contributions of cue-familiarity and accessibility processes. Perspectives regarding the FOK and memory representations have evolved in parallel. The internal monitor (Hart, 1965, 1967) was influenced by unitary strength-based models of memory (Wickelgren & Norman, 1966). Presently, researchers investigating the FOK take the perspective that FOKs are multiply determined based on a complex series of inferential computational processes. The idea that FOKs are multiply determined parallels present perspectives regarding memory representation. As opposed to reproductive verbatim retrieval, modern thinking regarding general memory experiences is that memories are constructions based on inferential processes that are influenced by such things as fluency of processing, ease of retrieval, and noncriterial recollection of specific features derived from perception or thought (Jacoby & Kelley, 1992a, 1992b; Jacoby, Kelley, & Dywan, 1989; Jacoby, Toth, & Yonelinas, 1993; Jacoby, Toth, Yonelinas, & Debner, 1994; Kelley & Jacoby, 1993, 1998; Yonelinas & Jacoby, 1996). Perceptual information (e.g., color, size), spatial information, temporal details, semantic information (e.g., category membership, associated items, connotation), emotional information, and cognitive operations are but a few of the features that combine to form a complex representation (e.g., Mitchell & Johnson, 2009). The value, relationship between, and time course of, these various features is the subject of the current direction in the field. The final section of this chapter examines the modern movement in FOK research. We review research that suggests the influence of noncriterial correct recollection of information on both FOK judgment magnitude and prediction accuracy. This new research is presented in the context of the accessibility heuristic that proposed that FOK judgments are based on the total amount of partial information that is retrieved. Within this context we evaluate when the conscious experience

of the FOK is valuable in specific situations (i.e., predictive of future retrieval) and irrelevant in others. We firmly support the model that monitoring affects control (Nelson & Narens, 1990). Thus, accurate monitoring has a greater chance of resulting in effective regulation of memory processes. We conclude this chapter with a discussion of the important applications of FOK research. Recent research suggests that quality, or accuracy, of retrieved partial information influences FOK judgment magnitude and prediction accuracy (Brewer, Marsh, Clark-Foos, & Meeks, 2010; Hertzog, Dunlosky, & Sinclair, 2010; Hertzog et  al., 2014; Thomas et  al., 2011, 2012). Further, some partial information seems to be more important for monitoring accuracy than others. Memory and metamemory research supports the conclusion that attempted retrieval of information can bring to mind information that may be either connected with or peripheral to the target (Toth & Parks, 2006; Yonelinas & Jacoby, 1996). As mentioned previously, early research on the FOK suggested that when participants retrieved partial information such as target fragments, those fragments were more likely to be correct than incorrect (R. Brown & McNeill, 1966; Kelley & Jacoby, 1993). This retrieval process resulted in relatively accurate FOK predictions. Whereas individuals may be more likely to retrieve accurate fragments of a target, thus resulting in accurate FOK predictions, myriad noncriterial information also can influence FOK judgment magnitude and prediction accuracy. We suggest that accuracy of certain kinds of partial information may impact FOK judgment accuracy. On the surface, this proposal seems at odds with the tenets of the accessibility heuristic; however, we do not suggest that quality of partial information is explicitly monitored. Rather, we suggest that the nature of the material and the factors at play during encoding and at the time of FOK judgment will influence the accessibility of correct partial information. When more correct partial information is accessible, participants are able to make more accurate predictions. For example, in Koriat’s (1993) first two experiments, when partial information were target fragments (i.e., individual letters from a five-consonant string), participants tended to retrieve more correct partial information, and FOKs were predictive of future recognition. Alternately, in experiment 3 of that same article, when partial information was a dimensional attribute of the target (i.e., affective connotation), FOK predictive accuracy was no better than guessing. In

addition, using a methodology similar to Koriat (1993, experiment 3), Thomas et al. (2011) found that participants provided higher FOK judgments when they retrieved the correct valence of the target item, as opposed to an incorrect target valence. Improved accuracy of retrieved valence information in that study resulted in more accurate FOK predictions. Importantly, Koriat (1993) used pronounceable nonword strings paired with target words that carried a positive or negative connotation. Thomas et al. (2011) used paired associates in which the cue was neutral and the target carried a positive or negative connotation. This small but significant change in experimental materials demonstrated an important point regarding FOKs. That is, when studied material incorporated semantic meaning or had the potential to be meaningfully integrated, the quality of accessible partial information became critical to the accuracy of FOK prediction. The semantic relationship between even unrelated paired associates will invariably be stronger than the relationship between a nonword–word pairing. An integrated unit may result in greater activation of related information (Balota & Lorch, 1986, 2004; Collins & Loftus, 1975; McNamara & Healy, 1988; Sass et al., 2012), resulting in access to more useful (i.e., more accurate) partial and contextual information associated with the target. Accessibility of correct partial information will influence the predictive value of the FOK. Support for this hypothesis was garnered in a recent study by Thomas et  al. (2012) in which the accessibility of specific classes of partial information was relevant. The authors investigated the influence of the inherent properties of the studied material and factors occurring at encoding on access to accurate partial information. Access to perceptual and semantic features was orthogonally manipulated through encoding procedures. Thomas et al. found that quality of conceptual features (i.e., category membership) more strongly influenced the accuracy of FOK prediction than perceptual features. In addition, when access to conceptual attributes was minimized, perceptual attribute accuracy remained irrelevant to FOK predictive value. Similarly, research has demonstrated that manipulations at encoding that direct attention to specific noncriterial attributes influence FOK judgment magnitude (Brewer et  al., 2010). That is, when participants were told to pay attention to specific irrelevant contextual elements of studied pictures, those noncriterial elements played a greater role in FOKs than elements also presented but not instructionally emphasized. FOKs were Thomas, Lee, and Hughes

91

higher when these irrelevant elements were correct as opposed to incorrect. These recent studies provide strong support for the intimate relationship between memory and metamemory. Although a variety of information may influence FOKs, features that convey meaning (i.e., category membership, semantic relationships) are extremely instrumental in FOK predictive accuracy. When information is devoid of meaning, we may be able to encode and store that information, but our ability to accurately assess FOK is limited. As one example, imagine a native English speaker who attempts to learn Japanese. The learner is charged with the task of associating meaning with initially meaningless figures (kanji). During these early stages of learning, the to-be-remembered information is meaningless. Instead of having conceptual attributes, the learner may only have access to perceptual features, which seem to be less effective for FOK prediction accuracy. Eventually, the learner will master the written language by effectively associating meaning to those kanji. At this point, the learner may be able to assess FOKs with a relatively high level of accuracy. Access to diagnostic cues is thus determined by the nature of the material (i.e., Hertzog et al., 2014; Thomas et  al., 2012) and the constraints in place at the time of encoding (Brewer et  al., 2010; Hertzog et  al., 2014). We strongly believe that a greater understanding of the processes that underlie accurate FOK experiences is required so that researchers can gain a greater understanding of the relationship between FOK judgments and the resulting controlled processes. This greater understanding should allow us to better capitalize on the relationship between monitoring and control to improve overall memory ability. We propose that many aspects of memory can be thought of as skills. Decisions at encoding, decisions regarding knowledge and access, and decisions at retrieval all affect objective memory outcome and all can be improved. In the context of the FOK, a broad understanding of the cues that result in FOK prediction accuracy can be utilized. These specified cues can be consciously and analytically employed to assess knowledge and guide memory search. In this sense, the output from nonanalytic inferentially based mechanisms can be utilized for more conscious analytic evaluations of information that affect FOKs. Although the accessibility heuristic may not involve explicit analysis, by encouraging participants to consciously evaluate the nature 92

Introspecting on the Elusive

of their assessment, the FOK process becomes explicit. Individuals can be informed as to which cues may be crucial for prediction accuracy and which cues may inappropriately influence judgment magnitude. In this sense, participants can employ theory-based judgments (e.g., Koriat & Bjork, 2006) in the context of FOK predictions. While the FOK may be derived from the automatic output from inferential mechanisms, it may be possible to encourage individuals to consider the nature of these influences in order to improve the efficacy of their conscious subjective experience.

Authors’ Note

The authors would like to thank John B. Bulevich for helpful discussions, rigorous reviews, and overall support with regard to the preparation of this chapter.

References

Allen-Burge, R., & Storandt, M. (2000). Age equivalence in feeling-of-knowing experiences. The Journals of Gerontology Series B:  Psychological Sciences and Social Sciences, 55, P214–P223. Aron, A. R., Robbins, T. W., & Poldrack, R. A. (2004). Inhibition and the right inferior frontal cortex. Trends in Cognitive Sciences, 8, 170–177. Balota, D. A., & Lorch, R. F. (1986). Depth of automatic spreading activation:  Mediated priming effects in pronunciation but not in lexical decision. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 12(3), 336–345. Balota, D. A., & Lorch, R. F., Jr. (2004). Depth of automatic spreading activation: Mediated priming effects in pronunciation but not in lexical decision. New York, NY: Psychology Press. Blake, M. (1973). Prediction of recognition when recall fails:  Exploring the feeling-of-knowing phenomenon. Journal of Verbal Learning & Verbal Behavior, 12, 311–319. Brewer, G. A., Marsh, R. L., Clark-Foos, A., & Meeks, J. T. (2010). Noncriterial recollection influences metacognitive monitoring and control processes. The Quarterly Journal of Experimental Psychology, 63(10), 1936–1942. Brown, A. S. (1991). A review of the tip-of-the-tongue experience. Psychological Bulletin, 109, 204–223. Brown, R., & McNeill, D. (1966). The “tip of the tongue” phenomenon. Journal of Verbal Learning & Verbal Behavior, 5, 325–337. Chua, E. F., Schacter, D. L., Rand-Giovannetti, E., & Sperling, R. A. (2006). Understanding metamemory:  Neural correlates of the cognitive process and subjective level of confidence in recognition memory. NeuroImage, 29, 1150–1160. Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82, 407–428. doi: http://dx.doi.org/10.1037/0033-295X.82.6.407 Connor, L. T., Balota, D. A., & Neely, J. H. (1992). On the relation between feeling of knowing and lexical decision: Persistent subthreshold activation or topic familiarity? Journal of Experimental Psychology:  Learning, Memory, and Cognition, 18(3), 544–554.

Costermans, J., Lories, G., & Ansay, C. (1992). Confidence level and feeling of knowing in question answering:  The weight of inferential processes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(1), 142–150. Elman, J. A., Klostermann, E. C., Marian, D. E., Verstaen, A., & Shimamura, A. P. (2012). Neural correlates of metacognitive monitoring during episodic and semantic retrieval. Cognitive, Affective & Behavioral Neuroscience, 12, 599–609. Eysenck, M. W. (1979). The feeling of knowing a word’s meaning. British Journal of Psychology, 70, 243–251. Fodor, J. A. (1983). The modularity of mind. Cambridge, MA: MIT Press. Glenberg, A. M., Sanocki, T., Epstein, W., & Morris, C. (1987). Enhancing calibration of comprehension. Journal of Experimental Psychology: General, 116(2), 119–136. Hart, J. T. (1965). Memory and the feeling-of-knowing experience. Journal of Educational Psychology, 56, 208–216. Hart, J. T. (1967). Memory and the memory-monitoring process. Journal of Verbal Learning & Verbal Behavior, 6, 685–691. Hertzog, C., Dunlosky, J., & Sinclair, S. M. (2010). Episodic feeling-of-knowing resolution derives from the quality of original encoding. Memory & Cognition (pre-2011), 38, 771–784. Hertzog, C., Fulton, E. K., Sinclair, S. M., & Dunlosky, J. (2014). Recalled aspects of original encoding strategies influence episodic feelings of knowing. Memory & Cognition, 42, 126–140. Hosey, L. A., Peynircioğlu, Z. F., & Rabinovitz, B. E. (2009). Feeling of knowing for names in response to faces. Acta Psychologica, 130, 214–224. Izaute, M., Chambers, P., & Larochelle, S. (2002). Feeling-of-knowing for proper names. Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale, 56, 263–272. Jacoby, L. L., & Kelley, C. M. (1992a). A process-dissociation framework for investigating unconscious influences: Freudian slips, projective tests, subliminal perception, and signal detection theory. Current Directions in Psychological Science, 1, 174–179. Jacoby, L. L., & Kelley, C. M. (1992b). Unconscious influences of memory: Dissociations and automaticity. In A. D. Milner & M. D. Rugg (Eds.), The neuropsychology of consciousness (pp. 201–233). San Diego, CA: Academic Press. Jacoby, L. L., Kelley, C. M., & Dywan, J. (1989). Memory attributions. Hillsdale, NJ: Lawrence Erlbaum Associates. Jacoby, L. L., Toth, J. P., & Yonelinas, A. P. (1993). Separating conscious and unconscious influences of memory: Measuring recollection. Journal of Experimental Psychology:  General, 122(2), 139–154. Jacoby, L. L., Toth, J. P., Yonelinas, A. P., & Debner, J. A. (1994). The relationship between conscious and unconscious influences: Independence or redundancy? Journal of Experimental Psychology: General, 123(2), 216–219. James, W. (1899). The stream of consciousness. Talks to teachers on psychology—and to students on some of life’s ideals (pp. 15–21). New  York, NY:  Metropolitan Books/Henry Holt and Company. Jameson, K. A., Narens, L., Goldfarb, K., & Nelson, T. O. (1990). The influence of near-threshold priming on metamemory and recall. Acta Psychologica, 73, 55–68. Jonides, J., Schumacher, E. H., Smith, E. E., Koeppe, R. A., Awh, E., Reuter-Lorenz, P. A.,… Willis, C. R. (1998). The role of parietal cortex in verbal working memory. The Journal of Neuroscience, 18(13), 5026–5034.

Kelley, C. M., & Jacoby, L. L. (1993). The construction of subjective experience: Memory attributions. Malden, England: Blackwell Publishing. Kelley, C. M., & Jacoby, L. L. (1998). Subjective reports and process dissociation:  Fluency, knowing, and feeling. Acta Psychologica, 98, 127–140. Kikyo, H., & Miyashita, Y. (2004). Temporal lobe activations of “feeling-of-knowing” induced by face–name associations. NeuroImage, 23, 1348–1357. Kikyo, H., Ohki, K., & Miyashita, Y. (2002). Neural correlates for feeling-of-knowing. An fMRI parametric analysis. Neuron, 36, 177. Koriat, A. (1993). How do we know that we know? The accessibility model of the feeling of knowing. Psychological Review, 100(4), 609–639. Koriat, A. (1995). Dissociating knowing and the feeling of knowing: Further evidence for the accessibility model. Journal of Experimental Psychology: General, 124(3), 311–333. Koriat, A. (1998). Metamemory: The feeling of knowing and its vagaries. In J. G. Adair & F. I.  M. Craik (Eds.), Advances in psychological science (Vol. 2, pp. 461–469). Hove, England: Psychological Press. Koriat, A., & Bjork, R. A. (2006). Mending metacognitive illusions:  A  comparison of mnemonic-based and theory-based procedures. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 32(5), 1133–1145. Koriat, A., & Levy-Sadot, R. (2001). The combined contributions of the cue-familiarity and accessibility heuristics to feelings of knowing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27(1), 34–53. Koriat, A., & Lieblich, I. (1974). What does a person in a “TOT” state know that a person in a “don’t know” state doesn’t know? Memory & Cognition, 2(4), 647–655. Krinsky, R., & Nelson, T. O. (1985). The feeling of knowing for different types of retrieval failure. Acta Psychologica, 58, 141–158. Lupker, S. J., Harbluk, J. L., & Patrick, A. S. (1991). Memory for things forgotten. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17(5), 897–907. Maki, R. H. (1999). The roles of competition, target accessibility, and cue familiarity in metamemory for word pairs. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 25(4), 1011–1023. Maki, R. H., & Serra, M. (1992). The basis of test predictions for text material. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 18(1), 116–126. Maril, A., Simons, J. S., Mitchell, J. P., Schwartz, B. L., & Schacter, D. L. (2003). Feeling-of-knowing in episodic memory:  An event-related fMRI study. NeuroImage, 18, 827–836. Maril, A., Simons, J. S., Weaver, J. J., & Schacter, D. L. (2005). Graded recall success: An event-related fMRI comparison of tip of the tongue and feeling of knowing. NeuroImage, 24, 1130–1138. McNamara, T. P., & Healy, A. F. (1988). Semantic, phonological, and mediated priming in reading and lexical decisions. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 14(3), 398–409. Metcalfe, J., Schwartz, B. L., & Joaquim, S. G. (1993). The cue-familiarity heuristic in metacognition. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 19(4), 851–861. Metcalfe, J., & Shimamura A. (1994). Metacognition: Knowing about Knowing (p. 11). Cambridge, MA: Bradford Books.

Thomas, Lee, and Hughes

93

Mitchell, K. J., & Johnson, M. K. (2009). Source monitoring 15 years later: What have we learned from fMRI about the neural mechanisms of source memory? Psychological Bulletin, 135(4), 638–677. Modirrousta, M., & Fellows, L. K. (2008). Medial prefrontal cortex plays a critical and selective role in “feeling of knowing” meta-memory judgments. Neuropsychologia, 46(12), 2958–2965. Nelson, T. O. (1984). A comparison of current measures of the accuracy of feeling-of-knowing predictions. Psychological Bulletin, 95(1), 109–133. Nelson, T. O., Gerler, D., & Narens, L. (1984). Accuracy of feeling-of-knowing judgments for predicting perceptual identification and relearning. Journal of Experimental Psychology: General, 113(2), 282–300. Nelson, T. O., Leonesio, J., Shimamura, A. P., Landwehr, R. F., & Narens, L. (1982). Overlearning and the feeling of knowing. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 8(4), 279–288. Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and new findings. In G. Bowers (Ed.), The psychology of learning and motivation: Advances in research and theory (Vol. 26, pp. 125–173). San Diego, CA: Academic Press. Norman, D. A., & Wickelgren, W. A. (1969). Strength theory of decision rules and latency in retrieval from short-term memory. Journal of Mathematical Psychology, 6, 192–208. Perrotin, A., Belleville, S., & Isingrini, M. (2007). Metamemory monitoring in mild cognitive impairment:  Evidence of a less accurate episodic feeling-of-knowing. Neuropsychologia, 45(12), 2811–2826. Reder, L. M. (1988). Strategic control of retrieval strategies. In G. H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory (Vol. 22, pp. 227–259). San Diego, CA: Academic Press. Reder, L. M., & Ritter, F. E. (1992). What determines initial feeling of knowing? Familiarity with question terms, not with the answer. Journal of Experimental Psychology, 18(3), 435–451. Reggev, N., Zuckerman, M., & Maril, A. (2011). Are all judgments created equal?: An fMRI study of semantic and episodic metamemory predictions. Neuropsychologia, 49(5), 1332–1342. Sass, K., Habel, U., Sachs, O., Huber, W., Gauggel, S., & Kircher, T. (2012). The influence of emotional associations on the neural correlates of semantic priming. Human Brain Mapping, 33(3), 676–694. Schacter, D. L. (1983). Feeling of knowing in episodic memory. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 9(1), 39–54.

94

Introspecting on the Elusive

Schnyer, D. M., Nicholls, L., & Verfaellie, M. (2005). The role of VMPC in metamemorial judgments of content retrievability. Journal of Cognitive Neuroscience, 17(5), 832–846. Schnyer, D. M., Verfaellie, M., Alexander, M. P., LaFleche, G., Nicholls, L., & Kaszniak, A. W. (2004). A role for right medial prefontal cortex in accurate feeling-of-knowing judgements:  Evidence from patients with lesions to frontal cortex. Neuropsychologia, 42(7), 957–966. Schwartz, B. L., & Metcalfe, J. (1992). Cue familiarity but not target retrievability enhances feeling-of-knowing judgments. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 18(5), 1074–1083. Shimamura, A. P. (2011). Episodic retrieval and the cortical binding of relational activity. Cognitive, Affective & Behavioral Neuroscience, 11(3), 277–291. Souchay, C., Moulin, C. J.  A., Clarys, D., Taconnat, L., & Isingrini, M. (2007). Diminished episodic memory awareness in older adults: Evidence from feeling-of-knowing and recollection. Consciousness and Cognition:  An International Journal, 16(4), 769–784. Thomas, A. K., Bulevich, J. B., & Dubois, S. J. (2011). Context affects feeling-of-knowing accuracy in younger and older adults. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(1), 96–108. Thomas, A. K., Bulevich, J. B., & Dubois, S. J. (2012). An analysis of the determinants of the feeling of knowing. Consciousness and Cognition: An International Journal, 21(4), 1681–1694. Toth, J. P., & Parks, C. M. (2006). Effects of age on estimated familiarity in the process dissociation procedure:  The role of noncriterial recollection. Memory & Cognition, 34(3), 527–537. Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of memory (pp. 381–402). New York, NY: Academic Press. Wickelgren, W. A., & Norman, D. A. (1966). Strength models and serial position in short-term recognition memory. Journal of Mathematical Psychology, 3(2), 316–347. Yaniv, I., & Meyer, D. E. (1987). Activation and metacognition of inaccessible stored information: Potential bases for incubation effects in problem solving. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 13(2), 187–205. Yonelinas, A. P., & Jacoby, L. L. (1996). Noncriterial recollection:  Familiarity as automatic, irrelevant recollection. Consciousness and Cognition:  An International Journal, 5, 131–141.

CH A PT E R

6

Tip-of-the-tongue States, Déjà vu Experiences, and Other Odd Metamemory Experiences

Bennett L. Schwartz and Anne M. Cleary

Abstract This chapter discusses several forms of metamemory hiccups—subjective experiences that alert us to potential conflict between our metacognitive state and our memory capabilities at the moment; for example, tip-of-the-tongue states, déjà vu experiences, and blank-in-the-mind states. These states occur when we set out to accomplish a task but find ourselves with the will to complete a task but unable to recall what that task was. This chapter describes these phenomena, the research on their causes and consequences, and why they are important to our understanding of metamemory in general. These experiences can prompt us to attempt to resolve these discrepancies through metacognitive control, such as by directing attention toward information-gathering or retrieval efforts. By alerting us that something is amiss, such experiences act as early-warning systems, allowing us to monitor and control our own mental processes. Key Words:  metamemory, déjà vu, tip-of-the-tongue states, TOT, blank-in-the-mind states

The first author of this chapter recently attended a concert of the violin virtuoso Hilary Hahn. Ms. Hahn played for over two hours without glancing at a page of musical notation. She probably played over 30,000 distinct notes from memory over the course of two hours. In addition to the note and how long to hold it (e.g., 8th note, 16th note), she must remember dynamics (loudness), bowing technique, finger position, the amount of vibrato to use, among other things for each and every note during each piece of music. Does Ms. Hahn ever have any metamemory hiccups; that is, a point at which she is not sure whether she knows what notes come in the next measure or the correct string to play those notes on, or fails to remember a piece of music but feels that it is in her memory despite being momentarily inaccessible? Watching a talented performer such as Ms. Hahn, one might think she never does, but as metacognition researchers, we are sure she does have such thoughts from time to time. This

chapter is about such metamemory errors or hiccups, instances in which our metacognitive system tells us that we know something when we momentarily do not or that we have been somewhere where we have not. It is our thesis that, like visual illusions, such metamemory errors tell us something about how the metacognitive system works. In turn, the relation of metamemory errors to metacognitive processes also tells us something about the nature of subjective experience. Thus, by studying metamemory hiccups we can make some deductions about the nature of metamemory and its role in conscious behavior. Consider a situation frequently experienced by students at exam time. During the exam, a student is plagued by the sudden inability to remember a fact or idea that was surely known earlier. For the student, this failure to recall a known fact has immediate consequences, such as a lower score on the test. This frustrating experience is often compounded 95

by the retrieval of the target answer as soon as the exam is handed to the instructor. One can ask two questions about this situation. First, why does the student have the retrieval failure in the first place? That is, what processes allow for the forgetting of learned material? This is a very old question in psychology and will not be addressed in this chapter (but for references on this topic, see Bjork & Bjork, 1992; Schacter, 2007). Importantly, however, this example illustrates that forgetting something that was learned does not mean that the information is gone forever. The target information may still be in memory, and the person is simply failing to retrieve it at the moment, though it may be retrieved later on or accessed if given the right cue to elicit retrieval (Smith & Moynan, 2008). Second, and more important to this chapter, how does the student know that he or she has information in memory that is not currently accessible? This is critical to metamemory—knowing what we know and knowing what we do not know. One of our main points in this chapter is that although metamemory hiccups may be frustrating, they can serve a useful and perhaps adaptive purpose in alerting us to what we know or do not know (Cleary, 2014; Schwartz & Metcalfe, 2011). In this chapter, we focus on several types of metamemory hiccups. Our first consideration will be the tip-of-the-tongue state (henceforth, TOT), which is the name for the experience just described. A TOT is a conscious feeling that a word is in memory even though it is temporarily inaccessible to the person (Schwartz & Metcalfe, 2011; 2014). In most cases, TOTs involve an experience that recall is imminent, and TOTs are also often accompanied by emotion (Schwartz, Travis, Castro, & Smith, 2000). These characteristics distinguish it from high feeling of knowing, which is often defined as a judgment of future recognition (Schwartz, 2008). For example, a person may be sure that the name of the actor who depicted Neo in The Matrix resides in her memory, fail to produce the name on demand (Keanu Reeves), yet, when making a feeling-of-knowing judgment, estimate that she would recognize the name if presented with it later. Our second consideration will be the déjà vu experience. A déjà vu experience is the feeling that one has been to a place before, even though one knows one has not (Brown, 2004; Cleary, 2014; Cleary et  al., 2012). For example, a person visiting the Louvre in Paris may be plagued with the eerie feeling that he has been there before, but know objectively that this is his first visit to Paris. Déjà 96

Metacognitive Experiences

vu experiences are more than just a sense of familiarity; they are accompanied by an experience of having been there before, despite objective knowledge to the contrary (Brown, 2004). In this chapter, we claim that TOTs and déjà vu experiences have much in common. They both tell us that we may have knowledge relevant to the current situation that we are failing to consciously access at the moment. Both are strong subjective experiences, which may persist for some time. Because of their subjective peculiarity, both experiences may also be remembered long after the word has been retrieved or the person has been convinced that he never visited that museum before. But TOTs and déjà vu states are joined by other experiences that also share these features. Other metacognitive hiccups that we discuss in this chapter are blank-in-the-mind states (Efklides, 2014) and the tip-of-the-nose experience (Jonsson, 2014). Our major focus will be on TOTs and déjà vu experiences, as these are the most widely studied of these phenomena, but we also will briefly touch on the others. Our goal is to give a functional account of these phenomena. It is our view that although they may seem like errors or “glitches in the system,” they actually arise from functional, adaptive properties of metacognition and serve a useful purpose. In this chapter we first discuss why we have TOTs, the nature of the underlying processes, and how TOTs actually serve a useful metacognitive function. We then turn to the déjà vu phenomenon and ask the same questions—why we have them, the nature of the processes that underlie them, and how these experiences also serve a useful metacognitive function. Next we discuss less well-studied metamemory experiences: the tip-of-the-nose phenomenon and the blank-in-the-mind state. We then describe a general framework for understanding these types of metamemory hiccups. Finally, we speculate about the nature of mental experience and how metacognitive hiccups inform consciousness.

Tip-of-the-tongue States—Cause or Effect?

Attention to the TOT state in the psychological community goes back to William James (1893), who wrote about both of the lack of actual retrieval and the frustrating subjective experience that accompanied it. However, it was Brown and McNeill (1966) who first explored TOTs in an empirical fashion. They gave study participants the definitions of rare words and asked participants to generate the target word (e.g., ambergris, cloaca, sampan, and apse). When participants failed to come up with

the target word, they indicated whether or not they were experiencing a TOT. Brown and McNeill later gave participants the target words and asked them if that was the word for which they were experiencing a TOT. They found that participants often had bits and pieces of the missing word, even when they could not recall the whole word. Since then, researchers have used a variety of stimuli to induce TOTs, including rare-word definitions, general-information questions, pictures of objects and animals, faces of famous people, and translation equivalents across languages (Brown, 2012). Diary studies have also been conducted that track real-world TOTs as they occur in everyday life; these studies show that TOTs are quite common, occurring as often as once a day for older adults (Brown, 2012). TOTs are also universal experiences across languages and cultures. In a survey of languages, Schwartz (1999) found that approximately 90% of those questioned expressed the feeling of temporary inaccessibility using the same “tongue” metaphor used in English, even in languages unrelated to Indo-European languages (e.g., Turkish, Amharic, Korean). Moreover, Brennen, Vikan, and Dybdahl (2007) identified a Mayan language, Q’eqchi’, that lacks a specific term for TOTs. Brennen et al. tested Q’eqchi’ speakers in Guatemala, many of whom had limited skills in Spanish. When the Spanish term (“punta de la lengua”) was explained to them, they reported having experienced TOTs many times before in the Q’eqchi’ language. When examining general information retrieval in Q’eqchi’ speakers, Brennen and colleagues found TOT rates among the Q’eqchi’ speakers to be comparable to those among speakers of Western languages. Although Brennen and colleagues (2007) tapped into a population different from the typical college student population, the TOTs that the Mayan speakers reported closely resembled those of English-speaking college students. Interestingly, TOTs also occur among sign-language users. Termed the tip-of-the-finger phenomenon in this case (Thompson, Emmorey, & Gollan, 2005), this experience occurs when a nonhearing person feels that despite not being able to think of the sign for a particular word, it is there in memory nonetheless (at the “tip of the finger”). The tip-of-the-finger experience shares characteristics with the TOT experience, such as occurring for proper names in particular, often being accompanied by partial target access and often leading to eventual target access from memory.

What is the cause of the TOT state? Is it that during the retrieval attempt, one’s spotlight or focus of attention moves onto the representation for the sought-after word in memory and detects it despite failing to actually shine a light on its identity? In this view, known as the direct-access view, the TOT is a direct reflection of an unrecalled word (see Brown, 2012); somehow, the word’s presence in memory is being directly detected without being consciously accessed. Or does the TOT experience emerge from clues and cues that may be available to the person? This view, known as the inferential view, is more common among researchers interested in metacognition (Schwartz & Metcalfe, 2011). The primary question at issue is whether the unretrieved word itself directly causes the TOT, or whether the TOT is an emergent property of other clues and pieces of information that are available. Consider the following questions:  Who is the current prime minister of Australia? What is the last name of the first person to land on the moon? What is the name of the actress who plays the character “Veronica Mars” in the TV show and movie of the same name? Perhaps one of these questions may induce a TOT for you. If you have a TOT experience, notice what happens to you. You may get agitated. You may reach for the nearest device that connects to the Internet to look up the answer. You may also be able to recall some information—the name of the first woman who was prime minister of Australia, the famous quote from the first person on the moon, and the names of other performers in the Veronica Mars television program. Indeed, researchers have been documenting such partial information from the beginning of TOT research (e.g., Brown & McNeill, 1966). But what does all this suggest about the TOT experience? There are now data that support the idea that TOTs are the product of a heuristic mechanism that relies on cues and clues (see Schwartz, 2006; Schwartz & Metcalfe, 2011; 2014). This inferential view is counterintuitive, as the TOT feels to us as though the target is right there, ready to be retrieved. The feeling is one of target access, but the cause of this feeling may be the combined effect of the familiarity of the cue or cues available and the ability to retrieve partial or related information. Evidence that cue familiarity plays a role was shown by Metcalfe, Schwartz, and Joaquim (1993), who demonstrated that more familiar cues led to more TOTs for paired associates, irrespective of the level of recall. That is, the TOTs tracked the familiarity of the cue, rather than the memorability of the target. Evidence that the ability to retrieve partial Schwartz and Cleary

97

or related information plays a role has been shown in other studies. For example, Schwartz and Smith (1997) showed that the retrieval of information related to the target word increased the likelihood of a TOT. In their study, participants studied fictional animals (see Figure 6.1). If the participants encoded biographical information about the animal, they were more likely to have a TOT for the animal’s name, even though the names were not more memorable in those conditions. Thus, access to related information fosters more TOTs. Finally, evidence suggests that people rely on other clues to infer the presence of a target in memory. For example, being in a TOT previously informs one’s current state. Warriner and Humphreys (2008) showed that having recently been in a TOT predicted a higher likelihood of being in a TOT again. It is likely that people use the remembered TOT as a cue for being in a TOT now, in much the same way as being able to retrieve something now informs the likelihood of future retrieval (see Kornell, 2012). Together, these data suggest that TOTs are not the direct result of access to an unretrieved target but are an attribution one makes based on the currently available information. If TOT states indeed result from inferences based on the combined hints and clues currently available, then one might expect TOTs to be associated with other attributions or inferences as well. That is, if

the combined available clues lead one to infer the presence of a target word in memory, might the same clues lead one to make other types of inferences? Evidence suggests that the answer is yes. As reviewed by Cleary, Staley, and Klein (2014), the presence of a reported TOT state is associated with the judgment of an increased likelihood that an inaccessible word was presented recently, regardless of whether it actually was. An example is a study by Cleary (2006), which showed that being in a TOT state for the answer to a general knowledge question was associated with an increased judged likelihood that the unretrieved answer was presented earlier in the experiment, even when it was not. Similarly, Cleary and Specker (2007) showed that being in a TOT state for a celebrity name was associated with an increased judged likelihood that the name was presented earlier in the experiment, regardless of whether it actually was, and Cleary, Konkel, Nomi, and McCabe, (2010) found that being in a TOT state for the name of an odor (ginger, coffee, lavender) predicted that participants would indicate having either smelled or seen the name of the odor earlier in the experimental session, regardless of whether they actually had. These patterns suggest that TOT states are associated with other inferences besides the inference that a word is in memory. From the heuristic-attribution perspective, it makes sense that, if the TOT state itself is an inference that

Figure 6.1  An example of a TOTimal. Drawing by Steven M. Smith. When more related information is provided, there are more TOTs for the target animal’s name (Schwartz & Smith, 1997).

98

Metacognitive Experiences

a word is in memory, other inferences might also be based on the same available information. In this case, the inference may be something along the lines of, “if I have a TOT for an item, I must have experienced it recently.” Also consistent with the inferential approach is the observation that being in a TOT state for one item can affect whether or not participants experience TOTs for another item. Schwartz (2011) showed that being in a TOT affects whether or not we experience a TOT for another item. He showed that if a previous item induced a TOT state, the next item was less likely to be in a TOT. In this case, the feeling of a TOT inhibits subsequent TOTs, although the feeling of the TOT does not inhibit subsequent recall. In this case, the attribution may be something along the lines of, “if I am in a TOT for item 1, I cannot be for item 2,” consistent with the inferential account. Schwartz and Metcalfe (2011) provide a detailed review to support the heuristic-metacognitive account of TOTs, namely the view that TOTs are metacognitive experiences based on heuristic inferences rather than a form of direct access to an otherwise inaccessible target (see Figure 6.2). They claim that TOTs occur for unrecalled items when

Partial semantic information Cue-based information Related information

Output answer

Yes

there is a sufficient buildup of cue familiarity, partial retrieval, and retrieval of related information. In this case, cue familiarity means whether or not the participant recognizes the cue as being shown earlier in the study (in episodic memory) or as eliciting some expertise (in semantic memory). Thus, for example, when asked if one knows the name of the first person on the moon, familiarity with the history of the Apollo program would be considered cue familiarity because the familiarity is about terms in the cue rather than the target answer. When we retrieve no partial and related information and have no familiarity with the cue, we will not have a TOT. However, when we do experience familiarity with the cue(s) and/or retrieve partial and related information, there is a criterion point, which, if exceeded will trigger the state. They also argue that this suggests a function or useful purpose for TOTs. Specifically, TOTs are warning lights that alert us to the possibility that a target can be retrieved. We use the available familiarity and partial and related information as the trigger for this warning light, and then the TOT pushes us to greater mental effort to retrieve the target. More specifically, the experience of a TOT state may prompt us to continue to search for the desired information. Given that all

Encode Cue

Integrate information from all sources

Syntactic information Phonemic information

Successful Retrieval? No

Amount of information above TOT criterion?

Yes

Experience TOT state

No Give up Figure 6.2  A model of how TOTs are formed. Information about partial and related information influences the likelihood that a TOT will occur.

Schwartz and Cleary

99

TOT studies that correlate TOTs with subsequent memory performance find a strong correlation between TOTs and eventual retrieval, this system may be an adaptive feature of human cognitive processes in this regard. As such, we have argued that TOTs, like other metacognitive judgments, serve an active useful function (Schwartz & Metcalfe, 2014). Our experiential feelings are markers that inform us that a particular task is possible, that a particular item is memorable, or that a particular word is retrievable. These markers can then alter our behavior, through control processes that direct our behavior. In support of this view of TOT states as both adaptive and useful, there is an accumulation of evidence suggesting that TOTs are related to increased cognitive effort toward retrieving a yet-unretrieved item. For instance, it is possible that the TOT drives people to spend more time attempting retrieval than if one is not in a TOT. To support this view, Schwartz (2001) asked participants to answer general-information questions. He measured the amount of time participants spent retrieving items and then correlated that with subsequent TOTs. He found that participants spent more time trying to resolve an unrecalled target if a TOT had occurred than in a “don’t know” state. More recently, Schwartz and Metcalfe (2013) showed that participants in TOTs for the answers to general-information questions were more likely than those in don’t-know states to seek out or request the target answer when only a limited number of such requests were possible (see Figure 6.3). Moreover, Litman, Hutchins,

and Russon (2005) found that when people experienced TOTs, they also experienced a drive to recall the item rather than “look it up” or have someone tell them the word. They argued that TOTs induce a sense of curiosity that may be beneficial for recall. TOTs provide us with information into an otherwise opaque retrieval system and signal that further effort is warranted. Thus, from the metacognitive perspective, TOTs are not simply a marker of failed retrieval but also a prod to future successful recall. One important caveat must be mentioned, however, in assessing the data described in this paragraph: all of the data in these studies are correlational. Thus, it is possible, for example, that a longer retrieval time is used as a cue as to whether a TOT will be experienced or not. It may also be that the desire to seek answers is not a result of the TOT, but rather a result of other processes. We hope that future studies will tease these possibilities apart. It is also the case that TOTs require or demand attention. Several studies show that the presence of TOTs interferes with other cognitive tasks. Ryan, Petty, and Wenzlaff (1982) asked participants to retrieve word definitions. If they failed to retrieve the word, they were asked if they were in a TOT, and then immediately engaged in a concurrent number-probe task. They were presented with a series of numbers and were required to indicate every time they saw a particular number. When participants were experiencing a TOT for the word definition, they made more errors on the subsequent number probe task. Similarly, Schwartz (2008) examined the effect of dual-task performance on

45 40

Percent seeking answer

35 30 25 20 15 10 5 0

TOT

N_TOT

Figure 6.3  Results from Schwartz and Metcalfe (2013). When people were given limited opportunities to seek out unrecalled answers, they chose answers for which they had TOTs during the retrieval stage.

100

Metacognitive Experiences

TOT rates. Participants were required to maintain a digit span while they answered general-information questions and were queried about TOT. Schwartz found that working memory performance decreased during TOTs relative to non-TOTs for general-information questions. Moreover, Schwartz found that being in a TOT interfered with one’s ability to do the working memory task. Digit-span performance decreased significantly when participants were in TOTs than when they were not. These two studies suggest that participants are allocating attention and resources to resolving TOTs, which causes performance on the secondary task to suffer. If TOTs require attention and if they interfere with other forms of cognition, it is likely that they must also be directing our behavior in particular ways (see Kuhlman & Bayen, this volume). These studies seem to suggest that TOTs alter the way in which we attempt retrieval, alerting us to temporary inaccessibility, which allows us to devote more cognitive resources to resolving those TOTs. Thus, the bottom line is that TOTs, rather than being an annoying nuisance, are an adaptive feature of the human cognitive system, a feeling that informs us about the potential to remember. We can use this information to guide our decision making regarding how to proceed, whether it is deciding on whether to continue to search for more information externally (such as by searching the Internet), or to devote cognitive resources internally in order to further attempt to retrieve the desired information from memory. Thus, future TOT research might examine how TOT states impact people’s judgments and decision making in various types of decision-making situations. Specifically, how do people’s decisions about how to proceed in various situations change as a function of whether or not a TOT state is present? TOTs are metacognitive experiences about language production. Thus, another future direction for TOT research is to more closely link the tradition of TOT research from metacognition with the research using TOTs as markers of failed lexical retrieval (see Brown, 2012; Warriner & Humphreys, 2008). For example, metacognition researchers tend to think of partial and related information as driving TOTs (e.g., Schwartz & Metcalfe, 2011), whereas language-production researchers tend to think of partial and related information as being caused by the same processes that cause the TOT, namely the process of lexical retrieval. Disentangling these issues will be critical to our understanding of TOTs. Moreover, in the real world, TOTs often occur in a social situation. For example, people

may be discussing a movie they just saw together and then simultaneously experience a TOT for the name of the actress who played a particular part. Understanding the social dynamics of TOTs has yet to be addressed.

The Déjà vu Experience

The déjà vu experience is a surprising experience in which one feels like a new situation has been experienced before (Brown, 2003, 2004). For example, when visiting the Louvre for the first time, you experience the feeling that you have been there before, despite knowing that it is your first time in Paris. From where does this feeling of déjà vu come? Why does it occur? We suggest that, as with TOTs, the experience of déjà vu may have a useful, adaptive purpose as well. It may be, for example, that even though you have never been to the Louvre before, the feeling of déjà vu is alerting you to the fact that something relevant exists in your memory. Perhaps you saw scenes from the Louvre in the movie The DaVinci Code years ago, for example, and are simply failing to recall that as the source of your current familiarity with the Louvre. Thus, you accurately recognize the scene as being familiar but are failing to correctly attribute the source of that familiarity. Brown and Marsh (2010) provide a humorous example of this by describing a hotels.com commercial in which a man enters his hotel room for the first time upon checking in and is struck by an awe-inspiring feeling of déjà vu. Wide-eyed and gasping, he exclaims to his wife, “I’ve been in this room before!” She asks, “Huh?” He exclaims, “I’ve been here, before!” She extinguishes the mystery and excitement by matter-of-factly stating, “Yeah, you did the virtual tour on hotels.com.” In this case, like the TOT, the déjà vu experience arises out of a particular memory process—familiarity—brought on by the surroundings. When familiarity occurs alongside the failure to actually retrieve the prior experience that is responsible for the familiarity, one is left only with the mysterious sense of familiarity. Like the TOT experience, this sense may serve an adaptive and useful purpose: It may prompt a person to devote more cognitive resources to attempt to retrieve whatever potentially relevant information may be residing in memory that produced the initial sense of familiarity. Until quite recently, there were no good experimental models of the déjà vu experience. However, Cleary and her colleagues have developed a method to induce déjà vu experiences reliably in a laboratory setting (Cleary, Ryals, & Nomi, 2009; Cleary Schwartz and Cleary

101

et al., 2012). The studies employ a variation of the recognition without cued-recall method (Ryals & Cleary, 2012). In this method, participants study a list words (e.g., forehead, disruption) and are then presented with a test list containing some items that resemble studied items (e.g., foneheed, disfraption). For each test cue, participants attempt to recall a word from the study that resembles it. Even when they cannot recall, they make a judgment of familiarity on the test cue. The key aspect of this paradigm is that when recall fails, familiarity ratings are higher for cues that resemble studied words than for cues that do not resemble studied words. Thus, otherwise novel test cues evoke a sense of familiarity because of their similarity to earlier words, even though those earlier words are not accessible. Cleary et al. (2012) make the case that this methodological approach is ideally suited for investigating déjà vu experiences in the laboratory. Cleary et al. (2012) used a virtual reality system to allow participants to be immersed within novel scenes in color and in 3D. They used this system to test a particular hypothesis about déjà vu known as the Gestalt familiarity hypothesis (Brown & Marsh, 2010), according to which déjà vu can result from familiarity brought on by the spatial arrangement of elements within an otherwise novel scene. For example, a person may experience déjà vu upon entering the church for a friend’s wedding. The source of the familiarity might actually be that the layout of the inside of the church strongly resembles the layout of a courthouse recently visited when the individual was a juror. The déjà vu is precipitated by the similarity between the two scenes. To investigate this idea, Cleary et al. (2012) used a virtual reality system with a head-mounted display that allowed

participants to see stereoscopic cues in the scene and to view the inside of each scene by moving their heads to look around. This enabled a feeling of being immersed within the space (see Figure 6.4), thus allowing for a verisimilitude not normally seen in memory experiments. Participants visually explored 16 such scenes, all created using the Sims game engine, during each learning phase of the experiment (32 altogether). To manipulate scene familiarity at test, each of half of the test scenes had a similar spatial configuration to a particular scene shown earlier but were otherwise novel scenes. Thus, if participants originally saw a courtyard scene in the initial learning phase, they might later see a museum scene that, although new, has the same spatial configuration as the courtyard scene (see Figure 6.5). Thus, Cleary et  al. (2012) could determine if the familiarity induced by the similar spatial configuration would induce a déjà vu experience in the absence of recalling the specific, previously viewed scene responsible for the familiarity. For example, perhaps the museum scene on the right-hand side of Figure 6.5 would seem familiar despite an inability to recall the earlier-viewed courtyard scene (on the left-hand side of Figure 6.5) as a scene with the same spatial configuration (and the source of the familiarity). The results were striking. First, in support of the Gestalt familiarity hypothesis, the probability of reporting a déjà vu experience increased when an otherwise novel test scene spatially mapped onto a previously viewed scene that failed to be recalled. Second, in another experiment it was shown that the likelihood of a déjà vu experience increased as the degree of match between the test scene and one in memory increased. Interestingly, the probability

Figure 6.4  The head-mounted display and virtual reality system used by Cleary et al. (2012).

102

Metacognitive Experiences

Figure 6.5  Examples of two scenes that map onto one another in their configuration of elements: the left scene is a courtyard; the right is a museum scene.

of déjà vu was greatest when a test scene was identical to one viewed previously but failed to be recognized as such (i.e., was falsely identified as a novel, never-before-viewed scene). For example, if the participant had previously viewed the courtyard scene on the left-hand side of Figure 6.5, but failed to remember this when later placed within the same courtyard scene later on in the experiment, the probability of reporting déjà vu in this situation was even higher than when a test scene was novel but merely spatially resembled that previously viewed scene (as in the right-hand side of Figure 6.5). In both experiments, scene familiarity ratings tracked the déjà vu probabilities in the different conditions, suggesting that déjà vu was related to the level of familiarity that the test scene elicited. The results of this second experiment, in particular, are interesting in light of the fact that it has typically been assumed that déjà vu experiences are simply errors, an odd quirk caused by spurious factors. For example, the déjà vu experiences that epilepsy sufferers may get just prior to a seizure is probably due to abnormal stimulation of the temporal lobe areas involved in recognition memory (O’Connor & Moulin, 2008), and some have suggested that such spurious brain activity may underlie déjà vu in non-epileptics as well (e.g., O’Connor & Moulin, 2008). However, the study by Cleary et  al. (2012) not only supports memory explanations of déjà vu in normal individuals by supporting the Gestalt familiarity hypothesis, but it also hints at a reason that the déjà vu experience, like the TOT experience, may be useful and adaptive. Specifically, that the probability of a déjà vu experience increases with an increasing match of a seemingly novel situation to one in memory suggests that the experience may indeed serve to alert a person to the fact that there is something relevant in memory.

More specifically, when participants in the study by Cleary et al. (2012) failed to recognize that they had actually previously been in a scene, that is when they were the most likely to report experiencing déjà vu with that scene. Thus, déjà vu may sometimes occur because we falsely believe a situation to be new when it is in fact not new. In such situations, it may be particularly useful to have the experience alert us to that possibility, as the déjà vu experience may prompt us to devote cognitive resources toward attempting to retrieve a relevant prior experience. However, even in cases in which the situation prompting déjà vu is new, the reason for the déjà vu experience may be that something relevant exists in memory nonetheless, as when a person had previously viewed his hotel room online but forgot about that, or when something similar to the current situation exists in memory but fails to be recalled. In any of these cases, the déjà vu experience may serve the useful purpose of prompting the devotion of more cognitive resources toward trying to figure out what potentially relevant thing may be residing in memory. In short, with déjà vu, something is familiar, but more conscious deliberative systems cannot detect where that familiarity comes from. Thus, the déjà vu experience pushes the individual to inspect more, to seek out more answers in order to resolve the discrepancy. That is, a déjà vu experience may urge us to consider the source of the strange feeling and thus retrieve or discover the actual experience that triggered the déjà vu. In the experiment described, the museum scene with the statue feels strangely familiar because it is similar to the courtyard visited earlier, and similarity to a prior experience is a factor known to contribute to familiarity (e.g., Ryals & Cleary, 2012). Without the déjà vu experience, we might not work as hard to determine if something Schwartz and Cleary

103

relevant to the current situation resides in our memory. However, because laboratory investigations of déjà vu are still in their infancy, studies on whether déjà vu experiences change search behavior remain to be conducted. A useful future direction for déjà vu research will be to investigate whether analogous findings to those shown with TOT experiences can be found with regard to search effort and the devotion of cognitive resources during déjà vu experiences. For example, do people tend to devote more retrieval time toward a recall attempt during déjà vu experiences than when déjà vu is not occurring? Do people selectively choose to have answers given to them during déjà vu experiences? Do déjà vu experiences take cognitive resources away from other tasks? These are all worthwhile directions for future research on déjà vu experiences.

Other Unusual Metacognitive Experiences

Déjà vu experiences may be the most famous, and TOTs the most well-studied, but there are other interesting subjective experiences that are linked to metamemory. For example, closely related to the TOT is the tip-of-the-nose phenomenon, which is the experience in which we are highly familiar with a smell but cannot name it. All it takes to experience the tip-of-the-nose phenomenon is to have someone shield the names of the bottles in your spice rack and have you sniff a few. Many of the odors will be familiar, but you may have a hard time identifying them by name. Certainly one will evoke a tip-of-the-nose experience. Another interesting subjective experience linked to metamemory is the blank-in-the-mind phenomenon. Blank-in-the-mind states refer to the experience one gets when one is sure that one was about to do something, but that something has been forgotten, in essence the experience of a prospective memory failure (Efklides, 2014). For example, you may run upstairs to get something, but then by the time you are there, forget what you were planning on getting. Once you return to your office downstairs and start working on your computer, you may realize that it was your reading glasses that you needed but did not get. On your second trip of the stairs, you may focus a bit more on the task at hand. We consider each of these in turn.

Tip-of-the-nose phenomenon.

When experimenters give a sample of common odors (e.g., peppermint or cinnamon) to participants to name, they will seldom get more than 50% of the names, and for unfamiliar odors 104

Metacognitive Experiences

the percentage correct may be far less (Jönsson & Stevenson, 2014). Nonetheless, participants will reliably report that many of these odors are familiar and may feel like they know the names of the odor. Why are odors so good at inducing familiarity, but so bad as a cue for the name of the source of the smell? Odor detection serves as an early warning system (Herz, 2005). Detecting odors lead us to detect toxins, detect predators, and ensure that our food is edible and our air is breathable. In this way, it has been argued that odor identification is not the primary function of smell (Köster, 2002). That is, odor naming is largely irrelevant to the function of smelling. Odors are a stimulus that are perceptually salient, but with which there are weaker connections between the sensory attributes and the semantic-verbal system that maintains the names (e.g., cinnamon, coffee, parmesan cheese, etc.). The goal of our olfactory system is to warn us away from toxins and lead us toward food—the semantic content of odors is less important. The tip-of-the-nose phenomenon was first studied empirically by Lawless and Engen (1977). They asked participants to smell a variety of odors and name them. When participants could not come up with the name of the odor, participants often claimed to know the source of the odor, even though, unlike a TOT, they were unable to generate partial information about the name of the odor. In another experiment, Lawless and Engen asked participants if the smell of an odor elicited a TOT for the name of the odor. TOTs were common, although again, there was little or no partial information retrieval during these TOTs for odor names. This finding was confirmed by Jönsson et al. (2005), who found that for familiar but unnamed faces, participants were able to generate much information, but for familiar but unnamed odors, they generated little partial or related information (also see Jönsson & Stevenson, 2014). In a study by Jönsson and Olsson (2003), participants attempted to name a number of odors. If the participants were unable to name the odor, they gave confidence judgments concerning the recognition of the correct odor name. Participants’ confidence was positively correlated with their actual ability to recognize the name of the odor when given a list of alternatives later in the study. Thus, the tip-of-the-nose experience leads to accurate judgments of future performance. In a neuroimaging study, Yeshurun and Sobel (2010) asked participants to smell an odor and indicate whether they

were having a TOT state for the name if they could not recall it. They found that TOTs were predicted by activity in the olfactory cortex, rather than areas associated with language production, confirming the tip-of-the nose experience occurs with olfactory familiarity and not from partial verbal information. That is, the tip-of-the-nose feeling is associated with the experience of the familiarity of the odor and not with issues related to the failed production of the target word. This fits with other findings reported by Cleary et al. (2010), who extended the recognition without identification phenomenon (the ability to discriminate between old and new items when the items themselves cannot be identified) to odors. They found evidence of odor recognition without identification (an ability to discriminate old from new odors when the odors themselves could not be identified), but only when the unidentified odors had actually been smelled previously at study. Though they could discriminate between unidentified odors that were smelled versus unsmelled before, participants could not discriminate between unidentified odors whose names were studied and unidentified odors whose names were not. Cleary et  al. (2010) suggested that there might be something special about olfaction such that semantic or verbal information does not play as large a role in familiarity with it as with other types of information. In short, a convergence of evidence suggests that metamemory feelings related to olfaction may be unique from other similar metamemory processes, such as the more standard TOT. That said, like déjà vu and TOTs, tip-of-the-nose experiences arise from misplaced familiarity (see Jönsson & Stevenson, 2014). We recognize an odor as being familiar, and because it is familiar, we infer that we ought to be able to name the odor, leading to the characteristic tip-of-the-nose experience. However, in this case, the familiarity arises in one system (the olfactory system) and is then applied to a second system (the naming or semantic system). Although tip-of-the-nose experiences are correlated with eventual recognition of the name of the odor, partial information retrieval is weak compared to TOTs for verbal-verbal associations. Nonetheless, we also suspect that tip-of-the-nose experiences also serve a function—alerting us to the fact that an odor is known and familiar in one way or another, thus pushing us to search our kitchen shelves to see if we can find a match. As mentioned previously regarding the universality of the TOT experience, there is also an interesting related phenomenon known as the tip-of-the-finger

experience (Thompson et al., 2005), which occurs when sign-language speakers cannot recall the hand configuration for a known word. As in spoken language TOTs, tip-of-the-finger experiences are associated with a greater retrieval of partial target information, such as the location the hand should be in when the sign is made and the first letter of the word in finger-spelling (Thompson et al., 2005). To date, the research on the tip-of-the-finger experience has concerned issues in language production, but we suspect that it would be interesting to examine the tip-of-the-finger experience from a metacognitive perspective as well.

Blank-in-the-mind states.

Blank-in-the-mind states are the subjective experience of knowing you were supposed to be doing something, but not able to recall what it is (Efklides, 2014). Consider the person who needs to fulfill her prescription at the pharmacy. In order to do so, she must drive there. She gets in her car and pulls out of her driveway, but now cannot recall where her intended destination was. She may not recall what her chore was until she later sees her empty prescription bottle on the table. Eflkides (2014) finds that people recognize these states as occurring to them and are usually able to describe a recent blank-inthe-mind state as well. Blank-in-the-mind experiences arise from a failure of prospective memory. We intend to perform a particular action, but then we forget what that action is. However, the experience of the blank-in-the-mind state may, in fact, be adaptive. If we forget our intended action but do not have a blank-in-the-mind experience, we may not know that we forgot an important action at all. It is the blank-in-the-mind experience that alerts us to our forgotten intention. Thus, as odd as it may sound, the blank-in-the-mind experience may have a metacognitive function as well—it informs us about forgotten intentions. If this view is correct, then variables that cause us to forget intended actions will also increase the rate of blank-in-the-mind experiences. This is exactly what Efklides and Touroutoglou (2010) observed in their study on blank-in-the-mind experiences. In the study, participants originally read a reading-comprehension text. Following the reading task, participants engaged in an arithmetic task. The participants were told that when they saw a particular number they were supposed to hit a key on the computer keyboard in order to answer a question about the previously read text. This served as the Schwartz and Cleary

105

prospective memory task. After participants completed the task, they were asked about a number of questions, including one about whether they had a blank-in-the-mind experience. A  number of participants reported having blank-in-the-mind experiences for what they were supposed to do during the prospective memory task. They found, interestingly, though that the more metacognitive monitoring tasks required of the participants, the more blank-in-the-mind states they experienced. In another experiment reported by Efklides (2014), participants were required to engage in a primary task, which was answering general-information questions. However, the task was interrupted at unpredictable intervals. Each interruption meant doing a secondary arithmetic task, which participants were to perform until they received a particular cue, which would signal them to press either a key to return to the primary task or another key to continue with the secondary task. The cue occurred either 2 seconds or 6 seconds after an arithmetic problem was presented. Thus, in the 6-second condition, the participant had already spent more time and devoted more working memory to the secondary task. Efklides found that more blank-in-the-mind experiences occurred for the 6-second lag compared to the 2-second lag, consistent with the idea that blank-in-the-mind experiences occur during more demanding prospective memory tasks. Radvansky, Tamplin, and Krawietz (2010) examined an analog of the blank-in-the-mind state, although participants were not specifically asked to report on it. In their study, participants negotiated virtual environments. In one situation, they were to leave one room to move to another room to accomplish a secondary task. They were more likely to forget the secondary task when they moved into the other room than when the task could be accomplished in the same room. Radvansky et  al. interpret this in terms of encoding specificity; changing the room results in a different set of contextual cues than were originally present when the task was decided on, and in the absence of these cues, the memory for what needs to be done is more likely to fail to be triggered. However, for our purposes, it postulates another potential cause of the blank-inthe-mind state, namely that the change of retrieval cues that takes place when changing rooms (or otherwise changing contexts) prevents an individual from recalling the task at hand that sent them there in the first place. In such cases, one may be left with the thought, “What did I come in here for?” 106

Metacognitive Experiences

Thus, it appears that the blank-in-the-mind experience, like TOTs and the déjà vu experience, is a metacognitive experience about an aspect of cognition. Blank-in-the-mind experiences inform of us when we have forgotten a prospective memory task. The participants in Eflkides’ (2014) experiment know that they have to do something, they just do not recall what it is. If we do not have a blank-in-the-mind experience, we do not know that we have forgotten something and may therefore make a critical error. If we do have a blank-inthe-mind experience, we may find ways of resolving the forgetting, thus allowing us to accomplish an otherwise forgotten task.

Why Are Subjective Experiences Critical?

Unlike machines, humans, and presumably other mammals, behave based on our thoughts and feelings rather than through reflexive feedback loops that involve no awareness. In essence, the way our neural systems represent themselves to human consciousness is through salient experience. These may be philosophically loaded statements, but at an experiential level, these statements are straightforward and intuitive. This view is obvious in the case of sensations. We move our heads to look at beautiful rainbows, not toward stimuli of varying electromagnetic frequency, and move our fingers to avoid the intense heat of the stove, not away from fast-moving molecules. In these examples, it is the subjective experience of the visual beauty of the rainbow and the anticipation of the subjective experience of searing pain from high heat that we think of as the cause of our action. We argue that metamemory experience has an analogous role to play in human cognition, in that it is the subjective experience that drives our action. For example, the confusion caused by a blank-in-the-mind state may cause us to direct our behavior toward discovering the contents of the missed thoughts. Forgetting to do something may occur periodically, but it is the blank-in-the-mind state that tells us it is occurring now. In this way, the study of metamemory may indeed be akin to psychophysics. Just as psychophysics strives to find reliable relations between physical energy such as wavelength of light and the subjective experience of color, so should metacognition research strive to find reliable relations between the cognitive processes and the subjective experiences that they evoke. For example, we study the strength of a memory and its relation to confidence (or a judgment of learning). And relevant to the

current discussion, we study the accessibility of an item and its correlation with TOT experiences. Thus, the general model we advance here is that we have a set of processes that are designed to elicit metacognitive experiences under a variety of circumstances. For example, when we have a strong feeling of familiarity for a novel place that is not accompanied by the recall of details regarding why it seems familiar, the déjà vu experience results. These experiences inform our conscious selves of discrepancies—between familiarity and recall or between action and intention, in the case of the blank-in-the-mind state. Without these experiences, we would not know that something was amiss. Thus, these experiences act as early-warning systems, allowing us to perceive that all is not right in our cognitive universes. Early-warning systems have a function—when we know something is amiss we can do something about it. Thus, these metamemory experiences allow us to control future cognitive processing, as when a TOT pushes us to seek out the correct answer.

Concluding Thoughts

In this chapter, we have reviewed a number of subjective experiences that we advance as serving a metacognitive function. TOTs are strong subjective experiences that inform us that an item, currently unrecalled, may well be retrieved soon. Déjà vu experiences alert us to a discrepancy between our feeling of familiarity and our knowledge of novelty. In both cases, the experience can drive us to alter our behavior. If you have a TOT, you might attempt to cue your memory by going through the alphabet. If you are experiencing déjà vu, you may look for clues as to why you are experiencing familiarity that you might not have without the feeling. Similarly, a blank-in-the-mind experience alerts you to the possibility that you have forgotten to do a task. If this task is critical, such as turning off the stove or taking heart medicine, that blank-in-the-mind experience might just save one’s life. In the case of the concert musician who has memorized hours of music to perform, we take our hat offs, too. To recall—at that level—for that long without any obvious metacognitive hiccups is truly a remarkable feat.

References

Bjork, R. A., & Bjork, E. L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. F. Healy, S. M. Kosslyn, & R. M. Shiffrin (Eds.), From learning processes to cognitive processes: Essays in honour of William K. Estes (Vol. 2, pp. 35–67). Hillsdale, NJ: Lawrence Erlbaum.

Brennen, T., Vikan, R., & Dybdahl, R. (2007). Are tip-ofthe-tongue states universal? Evidence from an unwritten language. Memory, 15, 167–176. Brown, A. S. (2003). A review of the déjà vu experience. Psychological Bulletin, 129, 394–413. Brown, A.S. (2004). The déjà vu experience. New  York, NY: Psychology Press. Brown, A. S. (2012). Tip of the tongue state. New  York, NY: Psychology Press. Brown, A. S., & Marsh, E. J. (2010). Digging into déjà vu: Recent research findings on possible mechanisms. In B. H. Ross (Ed.), The psychology of learning and motivation (Vol. 53, pp. 33–62). Burlington, VT: Academic Press. Brown, R., & McNeill, D. (1966). The “tip of the tongue” phenomenon. Journal of Verbal Learning and Behavior, 5, 325–337. Cleary, A. M. (2006). Relating familiarity-based recognition and the tip-of-the-tongue phenomenon: Detecting a word’s recency in the absence of access to the word. Memory & Cognition, 34, 804–816. Cleary, A. (2014). On the empirical study of déjà vu: Borrowing methodology from the study of the tip-of-the-tongue phenomenon. In B. L. Schwartz & A. S. Brown (Eds.), Tip-of-the-tongue states and related phenomena. New  York, NY: Cambridge University Press. Cleary, A. M., Brown, A. S., Sawyer, B. D., Nomi, J. S., Ajoku, A. C., & Ryals, A. J. (2012). Familiarity from the configuration of objects in 3-dimensional space and its relation to déjà vu:  A  virtual reality investigation. Consciousness and Cognition, 21, 969–975. Cleary, A. M., Konkel, K. E., Nomi, J. S., & McCabe, D. P. (2010). Odor recognition without identification. Memory & Cognition, 38, 452–460. Cleary, A. M., Ryals, A. J., & Nomi, J. S. (2009). Can déjà vu result from similarity to a prior experience? Support for the similarity hypothesis of déjà vu. Psychonomic Bulletin & Review, 16, 1082–1088. Cleary, A. M., & Specker, L. E. (2007). Recognition without face identification. Memory & Cognition, 35, 1610–1619. Cleary, A. M., Staley, S. R., & Klein, K. R. (2014). The effect of tip-of-the-tongue states on other cognitive judgments. In B. L. Schwartz & A. S. Brown (Eds.), Tip-of-the tongue states and related phenomena (pp. 75–94). New York, NY: Cambridge University Press. Efklides, A. (2014). The blank-in-the-mind experience: Another manifestation of the tip-of-the-tongue state or something else? In B. L. Schwartz & A.S. Brown (Eds.), Tip-ofthe-tongue states and related phenomena (pp. 232–263). New York, NY: Cambridge University Press. Efklides, A., & Touroutoglou, A. (2010). Prospective memory failure and the metacognitive experience of “blank in the mind.” In A. Efklides & P. Misailidi (Eds.), Trends and prospects in metacognition research (pp. 105–126). New  York, NY: Springer. Herz, R. S. (2005). Odor-associative learning and emotion:  Effects on perception and behavior. Chemical Senses, 30, 1250–1251. James, W. (1893). The principles of psychology. New  York, NY: Holt. Jönsson, F. U., & Olsson, M. J. (2003). Olfactory metacognition. Chemical Senses, 28, 651–658. Jönsson, F. U., & Stevenson, R. J. (2014). Odor knowledge, odor naming and the “tip of the nose” experience. In B. L.

Schwartz and Cleary

107

Schwartz & A. S. Brown (Eds.), Tip-of-the-tongue states and related phenomena (pp. 305–326). New York, NY: Cambridge University Press. Jönsson, F. U., Tcheckova, A., Lönner, P., & Olsson, M. J. (2005). A metamemory perspective on odor naming and identification. Chemical Senses, 30, 353–365. Kornell, N. (2012). A stability bias in human memory. In N. Seel (Ed.), Encyclopedia of the sciences of learning (pp. 4–7). New York, NY: Springer. Köster, E. P. (2002). The specific characteristics of the sense of smell. In C. Rouby, B. Schaal, D. Dubois, R. Gervais, & A. Holley, (Eds.), Olfaction, taste, and cognition (pp. 27–43). New York, NY: Cambridge University Press. Lawless, H., & Engen, T. (1977). Associations to odors: Interference, mnemonics, and verbal labeling. Journal of Experimental Psychology: Human Learning and Memory, 3, 52–59. Litman, J. A., Hutchins, T. L., & Russon, R. K. (2005). Epistemic curiosity, feeling-of-knowing, and exploratory behaviour. Cognition and Emotion, 19, 559–582. Metcalfe, J., Schwartz, B. L., & Joaquim, S. G. (1993). The cue familiarity heuristic in metacognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 851–861. O’Connor, A. R., & Moulin, C. J. A. (2008). The persistence of erroneous familiarity in an epileptic male:  Challenging perceptual theories of déjà vu activation. Brain & Cognition, 68, 144–147. Radvansky, G. A., Tamplin, A. K., & Krawietz, S. A. (2010). Walking through doorways causes forgetting: Environmental integration. Psychonomic Bulletin & Review, 17, 900–904. Ryals, A. J. & Cleary, A. M. (2012). The recognition without cued recall phenomenon:  Support for a feature-matching theory over a partial recollection account. Journal of Memory and Language, 66, 747–762. Ryan, M. P., Petty, C. R., & Wenzlaff, R. M. (1982). Motivated remembering efforts during tip-of-the-tongue states. Acta Psychologica, 51, 137–147. Schacter, D. L. (2007). Memory: Delineating the core. In H. L. Roediger III, Y. Dudai, & S. M. Fitzpatrick (Eds.), Science of memory:  Concepts (pp. 23–27). New  York, NY:  Oxford University Press.

108

Metacognitive Experiences

Schwartz, B. L. (1999). Sparkling at the end of the tongue: The etiology of tip-of-the-tongue phenomenology. Psychonomic Bulletin and Review, 6, 379–393. Schwartz, B. L. (2001). The relation of tip-of-the-tongue states and retrieval time. Memory & Cognition, 29, 117–126. Schwartz, B. L. (2006). Tip-of-the-tongue states as metacognition. Metacognition and Learning, 1, 149–158. Schwartz, B. L. (2008). Working memory load differentially affects tip-of-the-tongue states and feeling-of-knowing judgment. Memory & Cognition, 36, 9–19. Schwartz, B. L. (2011). The effect of being in a tip-of-the-tongue state on subsequent items. Memory & Cognition, 39, 245–250. Schwartz, B. L., & Metcalfe, J. (2011). Tip-of-the-tongue (TOT) states: Retrieval, behavior, and experience. Memory & Cognition, 39, 737–749. Schwartz, B. L., & Metcalfe, J. (2014). Tip-of-the-tongue (TOT) states:  Mechanisms and metacognitive control. In B. L. Schwartz & A. S. Brown (Eds.), Tip-of-the-tongue states and related phenomena (pp. 15-31). New York, NY: Cambridge University Press. Schwartz, B. L., & Smith, S. M. (1997). The retrieval of related information influences tip-of-the tongue states. Journal of Memory and Language, 36, 68–86. Schwartz, B. L., Travis, D. M., Castro, A. M., & Smith, S. M. (2000). The phenomenology of real and illusory tip-ofthe-tongue states. Memory & Cognition, 28, 18–27. Smith, S. M., & Moynan, S. C. (2008). Forgetting and recovering the unforgettable. Psychological Science, 19, 462–468. Thompson, R., Emmorey, K., & Gollan, T. (2005) Tip-ofthe-fingers experiences by ASL signers:  Insights into the organization of a sign-based lexicon. Psychological Science, 16, 856–860. Warriner, A. B., & Humphreys, K. R. (2008). Learning to fail: Reoccurring tip-of-the-tongue states. Quarterly Journal of Experimental Psychology, 61, 535–542. Yeshurun, Y., & Sobel, N. (2010). An odor is not worth a thousand words: From multidimensional odor to unidimensional objects. Annual Review of Psychology, 61, 219–241.

CH A PT E R

7

Sources of Bias in Judgment and Decision Making

Joe W. Tidwell, Daniel Buttaccio, Jeffery S. Chrabaszcz, Michael R. Dougherty, and Rick P. Thomas

Abstract Sources of bias in confidence and probability judgments, for example conservatism, overconfidence, and subadditivity, are some of the most important and rigorously researched topics within judgment and decision making. However, despite the seemingly obvious importance of memory processes on these types of judgments, much of this research has focused on external factors independent of memory processes, such as the effects of various types of elicitation format. In this chapter, we review the research relevant to commonly observed effects related to confidence and probability judgment, and then provide a memory-process account of these phenomena based on two models: Minerva-DM, a multiple-trace memory model; and HyGene, an extension of Minerva-DM that incorporates hypothesis generation. We contend that accounting for the dependence of judgments on memory provides a unifying theoretical framework for these various phenomena, as well as cognitive models that accurately reflect real-world behavior. Key Words:  overconfidence, probability judgment, memory, Minerva-DM, HyGene

One of the most widely studied topics within the judgment and decision-making (JDM) literature concerns estimates of confidence or likelihood. Traditionally, work on overconfidence within JDM has focused on the relationship between subjective estimates of confidence and performance, as indexed by proportion correct on a forced-choice judgment task. However, in recent years work on overconfidence has broadened to include effects such as overestimation biases more generally (Moore & Healy, 2008), as well as overestimation biases in probability judgment (Tversky & Koehler, 1994). In this chapter, we review various sources of bias in confidence judgments that have been revealed in the JDM literature, with the goal of detailing the cognitive underpinnings of judgments of confidence and likelihood.

Background and Historical Viewpoints

Perhaps the most well-known finding in the confidence judgment literature is the finding of

overconfidence, in which people overestimate their own performance as given by a discrepancy between performance accuracy and confidence. In the typical experiment, a participant answers a series of general knowledge questions such as, “Which building is taller? (a) Burj Khalifa (b) Empire State Building,” and then states her confidence in her answers. Three outcomes are possible from this paradigm:  good calibration, overconfidence, and underconfidence. A  participant is said to be well calibrated to the extent to which the mean confidence rating is equivalent to the proportion correct across all items. Overconfidence is observed when mean confidence is greater than proportion correct, and underconfidence when mean confidence is less than proportion correct. Figure 7.1 illustrates three hypothetical calibration curves for a two-alternative forced-choice task. In these kinds of tasks, participants are presented a question or statement, must choose between one of two proposed answers for the question, and then must provide a confidence 109

1.0

Proportion Correct

0.9

0.8

0.7 Underconfidence Overconfidence Perfect Calibration

0.6

0.5

0.5

0.6

0.7

0.8

0.9

1.0

Subjective Probability Figure 7.1  Idealized graph depicting three possible outcomes of a two-alternative forced choice calibration study.

judgment for their chosen response. For example, a participant could be presented with the statement “The Empire State Building is the tallest building in the world,” and with the possible responses of “True” and “False.” After the participant selected a response, she would then assign a probability between .5 and 1 as her confidence judgment that she chose the correct answer. The perfect calibration bi-sector line represents the situation in which participants’ confidence judgments are equal to their proportion of correct choices. The underconfidence curve indicates that confidence judgment underestimates proportion correct (e.g., when the participants state a confidence of 80% they are correct approximately 95% of the time), whereas the overconfidence curve indicates that confidence judgment overestimates proportion correct (e.g., when the participants state a confidence of 80% they are correct only approximately 58% of the time). Research on calibration dates back at least to the 1970s and the foundational papers by Lichtenstein and colleagues (Fischhoff, Slovic, & Lichtenstein, 1977; Lichtenstein & Fischhoff, 1977; Lichtenstein, Fischhoff, & Phillips, 1977), which combined have been cited over 3,800 times (Google Scholar, October 30, 2013). A search of PsycInfo using the search term “overconfidence” produced 673 hits, with 58 papers published between January 1 and December 31, 2012. Articles on overconfidence have appeared in journals across multiple disciplines, including economics, psychology, meteorology, medicine, and business, among many others. Clearly this topic has been of widespread interest 110

within academics. Outside of academics, the issue of how people express confidence, and its correspondence to objective standards of accuracy, has also been of interest. For example, how accurately people can forecast the probability of future events is of great interest to the defense intelligence community (IARPA, 2014). Though the finding of overconfidence is far from universal, it is by far the most commonly reported finding in the literature (Moore & Healy, 2008; for a more general discussion of calibration and overconfidence, see Higham, Zawadzka, & Hanczakowski in this volume) However, despite the many decades of work on overconfidence, theoretical progress in understanding why deviations from perfect calibration occur has been slow. As far back as 1982, Lichtenstein, Fischhoff, and Phillips noted that “a striking aspect of much of the literature … is its ‘dust-bowl empiricism.’ Psychological theory is often absent either as motivation for the research or as explanations of the results (p. 333).” While there has been a fair amount of theoretical work in recent years, the majority of these theories have been pitched at different levels of analysis, and little effort has been devoted to developing a coherent theoretical framework that can capture the various experimental phenomena. Table 7.1 provides a partial listing of empirical phenomena related to confidence and/or probability judgments studied within the JDM literature. As can be seen, the number of behavioral regularities related to confidence has grown considerably over the years. Unfortunately, so have the number of theoretical explanations for these phenomena.

Sources of Bias in Judgment and Decision Making

Table 7.1  Commonly observed effects related to confidence and probability judgment. Phenomenon

Description

Key References

Conservatism

Probability judgments are less extreme than what is anticipated by Bayes’ Theorem.

(Phillips & Edwards, 1966)

Overconfidence

Subjective confidence judgments are excessive with respect to proportion correct.

(Lichtenstein & Fischhoff, 1977)

Hard/Easy effects

People are overconfident for difficult items but underconfident for easy items.

(Lichtenstein & Fischhoff, 1977)

Expertise effects

Overconfidence decreases as a function of task-relevant experience.

(Dougherty, 2001; Keren, 1987)

Encoding effects

Overconfidence is decreased as a function of how well encoded information is stored in LTM.

(Dougherty, 2001)

Sample size effect

Overconfidence is more extreme when judgments are based on small samples.

(Hansson, Juslin, & Winman, 2008)

Format dependence

Magnitude of overconfidence varies depending on whether participants provide point estimates or confidence intervals.

(Juslin, Wennerholm, & Olsson, 1999)

Subadditivity

The subjective probability of an explicit disjunction (i.e., set of mutually exclusive events) exceeds the perceived probability of an implicit disjunction.

(Tversky & Koehler, 1994)

Alternative outcomes effect

The degree of subadditivity is affected by the objective probability distribution of the to-bejudged events.

(Dougherty & Hunter, 2003a; Windschitl & Wells, 1998)

Time pressure effects

Subadditivity increases when participants are placed under time pressure.

(Dougherty & Hunter, 2003a)

Working memory effects

Subadditivity decreases as a function of increases in working memory capacity.

(Dougherty & Hunter, 2003a)

Encoding effects

Subadditivity decreases as a function of how well information is encoded in LTM.

(Sprenger et al., 2011)

Irrelevant Subadditivity is affected by interference from information effects irrelevant information

For example, the overconfidence effect, which refers to the case where mean (confidence) > p (correct), has been described as a statistical artifact resulting from random error (Budescu, Erev, & Wallsten, 1997; Erev, Wallsten, & Budescu, 1994), biased sampling of items (Gigerenzer, Hoffrage, & Kleinbölting, 1991; Juslin, 1994), the reliance on small samples (Juslin, Winman, & Hansson, 2007), and error-prone memory encoding and retrieval (Dougherty, 2001; Juslin & Persson, 2002). The question addressed here is:  Can these phenomena be brought under a single common theoretical framework, and what does this theoretical framework tell us about sources of bias?

(Dougherty & S prenger, 2006)

A Memory-Processes Account of Confidence and Probability Judgment

Dougherty et al. (1999) proposed a multiple-trace memory model to account for a variety of decision phenomena, ranging from base-rate neglect and availability biases to overconfidence and underconfidence. This model, Minerva-DM (DM = Decision Making), is an extension of Hintzman’s (1984) Minerva-2 memory model, which has been used to understand a variety of memory-related phenomena such as recognition memory, cued recall, and frequency judgment. Like Minerva-2, Minerva-DM assumes a vector-based representation wherein each experienced event is represented in memory

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

111

by a vector of features. Retrieval from memory involves matching a probe vector to all traces in memory, computing their similarity to the probe vector, and then summing up the cube of the similarities across all traces to yield the strength of the memory response (referred to as the conditional echo-intensity in Minerva-DM) given the probe vector. Within Minerva-DM, these memory strengths are then used as the basis of both choice and confidence: The model chooses the alternative that yields the highest conditional echo-intensity, and judgments of probability are given by the memory strength for each alternative relative to its alternatives. For example, if the task involved assessing which is more populous, Bonn or Munich, it is assumed that probe vectors corresponding to each city name are matched against traces in memory to yield an overall memory strength for each alternative. Whichever city name yields the highest memory strength is chosen as the most populous, and confidence is given by I(M)/[I(M)+I(B)], where I(M) and I(B) are the echo intensities (i.e., memory strengths) derived from probing memory with Munich and Bonn, respectively. Assuming that that the memory strengths are valid indicators of the criterion variable of interest (Goldstein & Gigerenzer, 2002), one can exploit this knowledge for inferential tasks. Dougherty (2001) and Dougherty et  al. (1999) showed that Minerva-DM could account for a variety of effects within the confidence judgment literature merely as a function of memory, including the general overconfidence phenomena, decreases in overconfidence with expertise (Lichtenstein & Fischhoff, 1980), underconfidence (Lichtenstein, Fischoff, & Phillips, 1977; Griffin & Tversky, 1992), conservatism (DuCharme, 1970; Phillips & Edwards, 1966), and the effect of encoding quality and experience on decreasing overconfidence (Dougherty, 2001). Further, Dougherty (2001) showed how Minerva-DM could accommodate two popular alternative accounts of overconfidence:  random error models and ecological models. The insight gained from using Minerva-DM to model these findings is that many of them can be conceptualized as arising from an error-prone memory retrieval process governed by a small number of parameters (for more details see Dougherty et al., 1999; Dougherty, 2001). More recently, Thomas, Dougherty, Sprenger, and Harbison (2008) extended the Minerva-DM model into the domain of hypothesis generation. This new model is able to account for a new set of 112

empirical phenomena. Since HyGene is a superset of Minerva-DM it can account for all of the aforementioned phenomena as well. The fundamental goal of the HyGene model is to provide a comprehensive account of how people generate hypotheses, how those hypotheses feed into processes necessary for evaluating confidence or probability, and how the hypotheses influence information search in the context of hypothesis testing. To start, we use the term hypothesis in a general sense to refer to any mental event about which we wish to make a decision. From this perspective, hypotheses could be the two alternatives in a two-alternative forced choice task (which are provided by the experimenter), a set of explanations generated from memory, or a set of options from which one must choose. As a canonical example, consider the task of a physician, where the task is to decide upon a diagnostic hypothesis and render a treatment decision. We assume that this process starts with the observation of a small set of cues (e.g., symptoms and/or biographical data such as age, gender, or family history). This initial set of observed data (Dobs) is used to cue long term memeory (LTM), which ultimately leads to the generation of a set of plausible diagnostic hypotheses. The number of diagnostic hypotheses generated from LTM will depend on a number of factors, including the quality of the initial encoding of events into LTM, the disease base rates, the similarity of the presenting symptoms to examples stored in memory, time pressure, and working memory (WM) capacity. This set of plausible hypotheses is maintained in WM and comprises the Set of Leading Contenders (SOC)—the subset of possible hypotheses from LTM that are most likely given the data. Once hypotheses are generated into the SOC, they can then be fed into algorithms for assessing relative probability (Thomas et  al., 2008) or used to frame external information search (Dougherty, Thomas, & Lange, 2010). Figure 7.2 provides a diagram of the processes assumed to take place within the HyGene model. The computational details of the model are available elsewhere (Dougherty et  al., 2010; Thomas et al., 2008), so we do not present them here. The processes within HyGene are initiated when the decision maker observes a set of cues (data) in the environment. These cues activate traces in episodic memory (Step 1), which then lead to the extraction of an unspecified probe vector from episodic memory (Step 2). This unspecified probe vector is matched against hypotheses stored in semantic

Sources of Bias in Judgment and Decision Making

Environmental Data Dobs-1 , Dobs-2 . . . Dobs-i . . . Dobs-N

Step 1: Dobs-i activates traces in episodic memory

Steps 5 & 6: Probability judgment made via conditional memory match

Step 2: Extract Unspecified Probe from episodic memory

Step 3: Match Unspecified Probe against known hypotheses in semantic memory

Implement information search rule

Yes

No

Step 4: Add Hypothesis to Set Of leading Contenders (SOC)

Does K = KMAX?

Yes

Is As > ActMinH?

K=K+1

No

Figure 7.2  Schematic of the HyGene model. Dobs = Data observed (cue from the environment, which prompts retrieval); K = number of retrieval failures; As = activation of semantic hypothesis; ActMinH = the parameter governing how active a sampled hypothesis must be for it to be placed into the SOC; SOC (Set of Leading Contenders) = HyGene’s WM construct governing the number of hypotheses that can be simultaneously maintained; and Kmax = the number of retrieval failures the model is willing to entertain.

memory (Step 3), and are probabilistically sampled on the basis of their activation. Hypotheses are recovered from semantic memory and added to the SOC (Step 4) if their activation exceeds a threshold criterion (Ac), otherwise the retrieval attempt is deemed a retrieval failure and K (a count of the total number of retrieval failures) is incremented. This sample and recovery process is continued until KMAX retrieval failures have occurred, where KMAX corresponds to the total number of retrieval failures tolerated before generation is terminated (Harbison, Dougherty, Davelaar, & Fayyad, 2009). Once this threshold is met, the hypotheses maintained in the SOC are

fed into either a process for determining the relative probability of the hypotheses under consideration or a process for information search. Note that decision tasks in which the researcher provides to-bejudged alternatives (e.g., as in the two-alternative forced choice task), the generation process can be circumvented and confidence or probability can be assessed directly through a memory-matching process. The algorithm for assessing the relative probability of hypotheses in the SOC is based on Tversky and Koehler’s (1994) Support Theory framework. Tversky and Koehler assumed that the probability of

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

113

any given hypothesis (A) is made relative to a set of alternatives (B) by comparing the relative strengths of the evidence for the A and B hypotheses, P(A,B)=S(A)/[S(A)+S(B)], where S corresponds to the evidential support (strength of the evidence) for A and B correspond to a set of mutually exclusive hypotheses. Tversky and Koehler (1994) did not provide a process model to describe how the strength of evidence was evaluated; however, this is easily accomplished within HyGene. For our purposes, we assume that strength of evidence is derived from the relative memory activations. Within HyGene, memory activations correspond to a global familiarity signal that is derived from a conditional memory matching process. The familiarity signals of only those hypotheses that have been generated into the SOC are normalized to sum to 1.0. This assumption is critical both for normative reasons, but also for describing behaviors that deviate from normative models. Normatively, this assumption allows us to assume that people generally conform to the normative principle of additivity. However, note that the assumption of additivity is constrained to operate over only the subset of hypotheses that are maintained in the SOC. This implies that the probability judgments that people provide for a set of hypotheses should sum to 1.0 only if the number of alternative hypotheses considered by the decision maker entails the entire objective hypothesis space (i.e., all possible hypotheses). In cases where the number of hypotheses in the objective hypothesis space exceeds the number of hypotheses that are maintained in the SOC, then participants should show subadditivity, where the sum of probability judgment exceeds the probability of the inclusive set. To illustrate, imagine that you’ve just learned that a friend has been diagnosed with cancer, though you have not learned what form of cancer. What is the probability that your friend has lung cancer? Write down that number. Now, what is the probability that it is pancreatic cancer? Write down that number. How about breast cancer? Prostate cancer? Leukemia? Liver cancer? Hodgkin’s disease? If you’re like most people, the values you write down for each of these forms of cancer will quickly exceed 1.0, yet normatively the sum of the individual forms of cancer can be no greater than the probability of cancer, which in this case is 1.0 since you’ve already been told that your friend has cancer. The finding that the sum of the explicitly considered hypotheses (forms of cancer) exceeds the implicit disjunction (cancer) is referred to as 114

subadditivity, since the implicit disjunction is subadditive with respect to the sum of the explicitly considered alternatives.1 Note that with the manifestation of subadditivity, the judged probability of any given hypothesis within the objective hypothesis space will be excessive (i.e., a manifestation of overconfidence). HyGene naturally accounts for the finding of subadditivity as the result of impoverished hypothesis generation. Thus, the primary source of overconfidence (as defined by subadditivity) in HyGene is due to a failure to consider all possible objective hypotheses. However, impoverished hypothesis generation is but one of many sources of bias in probability and confidence judgments anticipated by HyGene and Minerva-DM. In the subsequent sections of this chapter, we elaborate on the various sources of bias anticipated by HyGene and Minerva-DM.

Sources of Bias

The phenomena detailed in Table 7.1 have traditionally been thought of independently, with different models used to account for the various effects. Here, we show that many of these effects, as well as others outlined in Dougherty et al. (1999) and Thomas et al. (2008), can be accommodated within the HyGene framework.

Conservatism

Conservatism refers to the tendency for probability judgments to be less extreme than objective values determined by Bayes’s theorem. The classical example of conservatism involves judging the probability for males and females, conditioned on height. In this case, height is the observed data, and the hypotheses are male versus female. Traditional accounts of conservatism assume that the effect arises from misperception or misaggregation. DuCharme (1970), however, suggested the conservatism arises due to a response biases and used a gender-height judgment task to illustrate the phenomena (see description from Dougherty et  al., 1999). The response bias account of DuCharme is conceptually similar to that proposed within the HyGene/Minerva-DM model, which postulates an activation threshold (Ac in HyGene) by which traces are deemed relevant to the decision task. To illustrate, we simulated the gender-height judgment task within HyGene/Minerva-DM by assuming that participants store exemplars in long-term memory (LTM) that correspond to males and females of different heights. We assume that the distribution of heights for males and females are

Sources of Bias in Judgment and Decision Making

Table  7.2 Frequency of  memory traces stored in  HyGene/Minerva-DM’s memory corresponding to  specific heights for males and females. Height (in) Gender

71.2

71

69.5

68.5

65.8

64.3

62.2

58.3

57.8

Male

1,127

1,276

2,565

3,410

3,605

2,371

775

18

10

Female

18

44

194

459

2,275

3,538

3,814

818

584

roughly normally distributed with a means of 67 in. and 63 in., respectively, and a variance of 2.64 in. for both distributions. To represent males and females of various heights in memory, 26,901 traces were stored in memory in proportion to the population distributions but restricted to discrete heights. Table 7.2 provides the specific number of traces corresponding to each gender/height combination, for 9 discrete heights. For the simulation, episodic memory was probed with a vector corresponding to a particular height. From this, a subset of traces in LTM that “matched” the height vector was activated in LTM and formed the judgment set. The relative probability of males versus females was assessed by comparing the relative echo intensity for males versus females within the activated subset. Within each simulation run, this process was repeated using each of the 10 heights on each simulation run, for a total of 1,000 simulation runs. The simulation results along with relevant judgment data drawn from Ducharme (1970) are plotted in Figure 7.3, where the odds from both the simulations and subject judgments are plotted on the Y-axis, while the true Bayesian

odds are plotted on X-axis. As should be clear, the model provides the traditional S-shaped curve that is often observed in real-world judgment tasks. Minerva-DM, however, fits these data without assuming any particular bias in the judgment process per se. Rather, the S-shape curve is the result of a leaky (i.e., error prone) memory process. Specifically, within the model, conservatism arises from the inability of the model to perfectly delineate the proper (objective) subset of traces in episodic memory. To the extent that the model is unable to delineate the proper set, it will show regressive, and hence conservative behavior, with the crossover of the identity line occurring at the population base rates. For the height data, we assumed roughly equal base rates of males to females. However, as a more general point, this crossover will correspond to the modeled base rates. Thus, the crossover point should be pushed downward—for example toward 0.25—if the ratio of Males to Females were closer to 3:1. Doing so produces a probability curve that resembles the well-known probability weighting function assumed in Kahneman and

Log Estimated Odds

2 1 0 Data Source

−1

MDM DuCharme (1970)

−2 −2

−1

0

1

2

Log Bayesian Odds Figure 7.3  Simulation using the Minerva-DM (MDM) component of Hygene to model data presented in Ducharme (1970). The data represent probability judgments of p(gender|height) for both males and females (Y-axis) plotted against the true probabilities (X-axis).

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

115

Tversky’s Prospect Theory (1979) in which rare events are perceived to be more frequent than they really are. What are the sources of conservatism within Minerva-DM/HyGene? There are two mechanisms that can drive conservatism within the model. One mechanism is a parameter that governs how well the model discriminates subset-relevant and subset-irrelevant traces in memory. For those familiar with HyGene, this is the Ac parameter (which is the cube of the Sc parameter in Minerva-DM). Ac is assumed to be a discrimination threshold in the model and is a free parameter. Ac is conceptually similar to a bias parameter, but here the bias governed by Ac concerns what traces are considered relevant to the decision task. The other mechanism is similarity. To the degree that a highly similar feature represents the heights (i.e., Dobs) of people stored in memory, the model will be less good at delineating the proper subset of traces in episodic memory. Thus, by merely increasing or decreasing the similarity of the events stored in memory, HyGene anticipates that conservatism will show a corresponding increase or decrease.

Overconfidence

Overconfidence is the antithesis of the conservatism effect: Rather than judgments being less extreme than the objective standard, the overconfidence effect corresponds to situations in which confidence or probability judgments are more extreme than the objective standard. As outlined in the beginning of this chapter, overconfidence is frequently studied via a two-alternative forced choice task, and typically this task involves either a general knowledge test of sorts (Fischhoff et  al., 1977; Lichtenstein & Fischhoff, 1977; Lichtenstein et  al., 1977) or a categorization task (e.g., Dougherty, 2001; choosing which of two diseases is most likely, given a symptom). The primary difference between tasks used to study conservatism and tasks used to study overconfidence lie in the objective standard. In studies of conservatism, confidence judgments are compared to Bayesian probabilities. In studies of overconfidence, confidence judgments are typically compared to proportion correct. The overconfidence effect manifests when the mean confidence over a set of questions exceeds the corresponding accuracy, as defined by proportion correct. As robust and well known as this effect is, it seems to manifest in only a limited number of circumstances. For example, people tend to show overconfidence when the judgment task contains hard items, but show the opposite 116

pattern (underconfidence:  mean confidence < proportion correct) when the items are easy. This has been dubbed the hard/easy effect (Gigerenzer et  al., 1991; Juslin, 1994; Lichtenstein et  al., 1977). Other effects observed within the literature include decreases in overconfidence with increases in experience or expertise and increases in overconfidence under conditions that degrade how well information is represented or encoded in memory (Dougherty, 2001). Figure 7.4 plots simulation results using Minerva-DM/HyGene (bottom 2 panels) alongside data from Dougherty (2001) (top two panels). In Dougherty (2001), participants performed a categorization task in which they learned to diagnose patients as having one of two diseases, given one of 8 symptoms. In Experiment 1, the primarily manipulation was the number of learning trails (80 versus 240)  whereas in Experiment 2 encoding quality was manipulated. As can be seen, Minerva-DM/HyGene anticipates the change in calibration observed both as a function of experience and encoding: Increasing the number of traces in memory from 80 to 240 and increasing encoding from L = 0.40 to L = 0.60 leads to a decrease in overconfidence and improved calibration. The effects within Minerva-DM/HyGene manifest due to the fact that both variables decrease the random noise attributed to memory retrieval. As the noise in memory retrieval decreases, so, too, does overconfidence. The explanation put forth by Minerva-DM/ HyGene is consistent with the proposal of Erev et al. (1994) that overconfidence arises from variability in the judgment process, but it goes further by suggesting that variability in judgment arises from memory retrieval processes (M. R. Dougherty, 2001; Pleskac, Dougherty, Rivadeneira, & Wallsten, 2009). Note that it is possible for a judge to be both overconfident with respect to the proportion correct and conservative with respect to the objective probabilities. This simultaneous overconfidence/ underconfidence has been observed in a number of studies (Erev et  al., 1994) and can also be reproduced within the Minerva-DM component of HyGene. Erev et al. (1994) argued that it resulted from random error and the corresponding regression effects when judgments are aggregated. While this account is consistent with HyGene’s account of overconfidence, it is not consistent with how HyGene produces conservatism. Further, the construct of random error within HyGene is specific to error that arises from memory. In fact, there is no specific error term added to the judgment process

Sources of Bias in Judgment and Decision Making

Experiment 1

1

80 Trials 240 Trials

0.6 0.4 0.2 0

Poor Encoding Good Encoding

0.8 P (true | pc)

P (true | pc)

0.8

Experiment 2

1

0.6 0.4 0.2

0

0.2

0.4

0.6

0.8

0

1

0

Mean Probability Judgment

1.00

1.00 80 Traces 240 Traces

0.6

0.8

1

Poor Encoding (L=.4) Good Encoding (L=.6)

0.75 P (true | pc)

P (true | pc)

0.4

Mean Probability Judgment

0.75

0.50

0.50

0.25

0.25

0.00 0.00

0.2

0.25

0.50

0.75

1.00

0.00 0.00

Mean Probability Judgment

0.25

0.50

0.75

1.00

Mean Probability Judgment

Figure 7.4  Effect of encoding and experience on the calibration of probability judgments predicted by HyGene. Data panels are reprinted from Dougherty (2001), Journal of Experimental Psychology: General.

per se within HyGene. All of the error is a product of noisy memory retrieval.

Subadditivity

Subadditivity refers to the empirical phenomenon wherein the probability of an implicit disjunction is less than (subadditive) what one would expect of the probability assigned to the explicit disjunction. This finding is quite robust and has been found in numerous experiments (M. R.  Dougherty & J.  Hunter, 2003a; M. R.  Dougherty & J.  E. Hunter, 2003b; M. R.  Dougherty & Sprenger, 2006; A. Sprenger & Dougherty, 2006, 2012; A. M. Sprenger et al., 2011). As a prototypical example, participants might be told to imagine that an individual was randomly sampled from the population and it was determined that they suffer from some form of cancer. They’re then asked to rate the probability that the person suffers from various forms of cancer, such as lung cancer,

pancreatic cancer, liver cancer, prostate cancer, and finally all other forms of cancer. Normatively, the sum of these later probabilities should equal 1.0. However, behaviorally, people generally produce excessive probability judgments, wherein the sum of the unpacked hypothesis (various forms of cancer) is greater than the probability assigned to the packed hypothesis “cancer” (which in this example is 1.0). But why does this effect so routinely obtain? Within HyGene, subadditivity arises from the failure to consider alternative hypotheses. Remember the example above where you judged the probability that your friend would die from various types of cancer. If you failed to generate or consider all possible forms of cancer, then the rating assigned to each individual form of cancer should have been overestimated (i.e., they should have summed to greater than one). HyGene, anticipates that the perceived likelihood of any particular event will depend both on the strength of evidence for

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

117

the judged event, and on the overall strength of evidence for the alternatives and the number of alternatives included in the comparison process. This very general and simple explanation implies that the magnitude by which participants show subadditivity should be affected by variables that impede or reduce the number of alternatives included in the comparison process. We consider three sources of subadditivity bias next.

Working memory as the source of subadditivity.

A central component of HyGene concerns the role of WM in maintaining the set of leading contender hypotheses. The SOC comprises those hypotheses that have been generated from memory and queued for use in the comparison process to derive a probability judgment. Because WM is assumed to be capacity limited, the number of hypotheses that can be explicitly considered by the decision maker is typically less than total number of potential hypotheses contained within the objective hypothesis space. This implies that the limited capacity of WM is major factor in driving the subadditivity effect, and that the magnitude of the subadditivity effect should covary with individual differences in WM capacity. The anticipated relationship between subadditivity and individual differences in WM was confirmed by Dougherty and Hunter (2003a; see also Dougherty & Hunter, 2003b; Sprenger, et al., 2011). In their experiment, participants learned the relative frequencies of a set of 8 mutually exclusive

350

and exhaustive events drawn from each of 4 categories, with the distribution (relative frequencies) manipulated such that for some categories the distribution consisted of a single highly frequent event (single-peaked distribution), with other events occurring rarely, and for other distribution were approximately of equivalent probability (uniform distribution). After studying each distribution, participants judged the probability of each of the 32 events (8 events x 4 categories). Figure 7.5 shows the sum of the probability judgments for each of the 4 distributions, along with the corresponding predictions derived from HyGene. Note that for all four distributions, the judgment sums exceed 100%, indicating that participants’ judgments were subadditive. Note also, that the magnitude of the effect co-varied with distribution. Importantly, Dougherty and Hunter (2003a) also measured individual differences in WM capacity, and found a reliable correlation with judgment magnitude: Participants with higher WM capacity showed less subadditivity (r = –0.37). These results were anticipated by HyGene. The covaration with distribution is a natural byproduct of the HyGene model, whereas the covaration between judgment and WM capacity could be accounted for by varying a parameter that specifies how much information can be maintained in WM. While the account provided by HyGene for the relationship between WM and judgment is plausible, it is important to point out that studies focused on manipulating WM load while people make their judgments have produced only small effects on judgment magnitude

HyGene Data from Dougherty & Hunter (2003a)

Sum of Judgments

300

250

200

150

100

20-42-2-2-2-2-2-2

20-20-20-3-3-3-3-2 20-15-15-15-3-2-2-2 20-10-9-9-8-8-8-2

Figure 7.5  Mean (standard errors) subadditivity scores from Dougherty and Hunter (2003a) across four distributions used in the experiment.

118

Sources of Bias in Judgment and Decision Making

(Sprenger et  al., 2011). This implies that the relationship between WM capacity and judgment is more complex than merely a reduction in capacity, and it may depend on factors more closely related to retrieval from LTM.

Retrieval as a source for subadditivity.

In as much as the input into the comparison process is dependent on retrieval from LTM, then one would expect a dependence of judgment on retrieval: Retrieving more hypotheses should lead to an increase in the number of hypotheses included in the comparison process and a corresponding decrease in judged probability. While this prediction seems straightforward, it runs counter to the predictions of HyGene. Before describing why, we first review the relevant data. Sprenger et  al. (2011) tested the impact of divided attention at encoding on subadditivity. In experiment 3 of Sprenger et al., participants studied four distributions of hypotheses (with 8 hypotheses per distribution) under either full or divided attention. Two of the distributions consisted of 8 items that appeared with roughly equal frequency (the balanced distribution), whereas for the other two distributions, two of the items occurred with much higher frequency (unbalanced distribution). After study, but before judgment, participants were asked to retrieve as many hypotheses as possible from two of the four distributions, after which participants judged the probability of each of the hypotheses from all four distributions. Prior work within the memory literature suggests that divided attention during study produced relatively big reductions in recall accuracy (Craik, Govoni, Naveh-Benjamin, & Anderson, 1996). Therefore, we reasoned that having participants learn the distributions about which they would be making judgments under divided attention it would lead to both a reduction in the number of hypotheses they could generate and a concomitant increase in subadditivity. This is exactly what we found, as shown in Figure 7.6. Further, as reported in Sprenger et  al. (2011) the number of hypotheses that participants generated in the recall task was negatively correlated with judgment magnitude: Participants who generated more hypotheses showed less subadditivity. In general, HyGene anticipates the finding that increasing the number of hypotheses included in the comparison process should lead to a corresponding decrease in subadditivity. However, in the case in which retrievability is manipulated by encoding quality, HyGene makes the opposite

prediction. HyGene predicts that divided attention during encoding should lead to a decrease in subadditivitity, despite the fact that fewer hypotheses are included in the comparison. The reason for this prediction stems from the fact that the computational model uses memory strengths in computing the probability of each hypothesis in the SOC. These memory strengths increase exponentially as encoding increases, which can offset the increased number of hypotheses included in the comparison process. Thus, the effect obtains due to a list strength effect, which generally is predicted by global matching models (Clark & Gronlund, 1996).

Memory monitoring as a source of subadditivity.

While retrieval is clearly important for defining the set of hypotheses that get fed into the comparison process to derive probability judgments, what happens when the decision maker generates irrelevant hypotheses? Within the memory literature, retrieval processes are often conceived as taking place within the context of a system that monitors for intrusions. Intrusions are items that are objectively incorrect, in that they were not part of the original study list. An example of an intrusion error would be incorrectly recalling “apple” when trying to recall a previously studied list of fruits for which “apple” was absent. Importantly, intrusion errors are a typical aspect of memory retrieval and tend to be higher in contexts that promote proactive interference—when previously learned information intrudes on the recall of newly learned information. This occurs, for example, when recalling fruit names from previously learned lists of fruits when trying to recall only the most recent list of fruit names (Rosen & Engle, 1997). Within the decision-making literature, an irrelevant hypothesis is one that lies outside the space of potential hypotheses. For example, imagine a doctor who maintains the “sprained ankle” hypothesis, for a patient who is suffering loss of feeling in the right arm. Because ankle sprain is not a legitimate cause of loss of feeling in the arm, it is irrelevant to the judgment task and therefore should not impinge on determining the causal hypotheses for loss of feeling. In as much as hypothesis generation entails retrieval, it seems reasonable to assume that intrusions (i.e., irrelevant hypotheses) are likely to occur in the context of diagnostic reasoning tasks, just as they do in memory tasks. Thus, the question is not whether people generate irrelevant hypotheses, but rather how they treat those hypotheses once they’ve been generated.

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

119

5

Mean Number Recalled

4

Low Cognitive Load High Cognitive Load

3

2

1

0

Unbalanced

Balanced

350

Sum of Judgments

300

Low Cognitive Load High Cognitive Load

250

200

150

100

Recall

No Recall Unbalanced

Recall

Balanced

No Recall

Figure 7.6  Number of hypotheses recalled for the high and low cognitive load conditions by distribution type (top panel) and mean subadditivity scores (bottom panel) further broken down by recall condition. Graphs reprinted from Sprenger et al., 2011, by permission of the author. Top panel reprinted with permission from Dougherty, 2001, copyright of the APA.

Dougherty and Sprenger (2006) investigated two mechanisms by which irrelevant hypotheses might impinge on probability judgment:  Discrimination failure and inhibition failure. The discrimination failure account presupposes that people treat irrelevant hypotheses, for example the “sprained ankle” hypothesis for a patient who is suffering loss of feeling in the right arm, as if they were relevant simply because they cannot discriminate between relevant and irrelevant hypotheses. In contrast, within the inhibition failure account, it is assumed that people can distinguish between relevant and irrelevant hypotheses, but cannot inhibit irrelevant hypotheses or prevent them from occupying WM resources. 120

Thus, irrelevant hypotheses would affect judgment in the same way that a divided attention task would, by reducing the amount of WM resources available for maintaining relevant hypotheses for input into the comparison process. As detailed above, conditions that reduce the number of hypotheses fed into the comparison process should generally lead to an increase in judged probability (and consequently subadditivity). In contrast, under the discrimination failure account, the probability assigned to any particular hypothesis should be dependent on the strengths of both relevant and irrelevant hypotheses. Using an adaptation of the proactive interference paradigm, Dougherty and Sprenger

Sources of Bias in Judgment and Decision Making

(2006) showed that the judged probability of any given hypothesis decreased or increased as a function of the strength of the irrelevant hypotheses. Judged probability increased under conditions in which the irrelevant hypotheses were weak, but decreased under conditions in which the irrelevant hypotheses were strong. This pattern of findings is uniquely consistent with the discrimination failure account, since it is only under this account that the strength of irrelevant hypotheses can influence the judged probability. In contrast, the inhibition failure account predicts that the probability of any given hypothesis should decrease regardless of the strength of the irrelevant hypotheses if they are generated into WM. Thus, another source of bias in probability judgment arises due to memory monitoring errors (M. R. Dougherty & Franco-Watkins, 2003; M. R. Dougherty & Sprenger, 2006).

General Discussion

The overarching goal of this chapter is to delineate some of the sources of bias in confidence and probability judgment that are frequently observed in JDM tasks. The majority of the biases reviewed here can be traced quite directly to memory processes. These findings imply that judgments of probability and confidence, as they have traditionally been studied within the JDM literature, are interdependent with memory—a view that is explicitly accommodated by both Minerva-DM and HyGene. Despite the seemingly obvious dependence of judgment on memory suggested by the studies reviewed here and others, mainstream decision researchers have been slow to recognize this dependence. Instead, much of the work within the JDM literature has focused on the elicitation stage of judgment, as if to assume that expressions of confidence and probability are decision phenomena that exist independent of the inputs of memory. For example, within the JDM field, there exists a relatively robust literature on methods of elicitation wherein the focus tends toward understanding why people frequently define narrow confidence intervals in an interval-elicitation task. While this is no doubt important and rigorous work, the view espoused here is that methods of elicitation will interact with memory processes. If true, then studying elicitation methods without explicitly modeling or acknowledging the dependence on memory may lead to a misleading view of judgment processes.

Memory as Primitive

The idea that memory processes serve as input to the judgment process suggests that memory is

a primitive for JDM. While this has been recognized with the metacognitive literature (Dunlosky & Bjork, 2008), this has not been a dominant view within the JDM literature. While decision researchers tend to focus more on the decision phase, metacognitive researchers focus more on the relationship between memory and judgment (M. R.  Dougherty, Scheck, Nelson, & Narens, 2005; Dunlosky & Bjork, 2008; Hertzog, Fulton, Sinclair, & Dunlosky, 2014; Matvey, Dunlosky, & Guttentag, 2001; Nelson & Dunlosky, 1991, 1992). This later view explicitly acknowledges the dependence of judgment on memory (Dunlosky & Bjork, 2008), whereas the former relegates memory processes to a secondary role. Our position is that memory processes are primitive to JDM (M. R.  Dougherty, Gronlund, & Gettys, 2003). According to the memory as a primitive hypothesis, understanding JDM requires that one explicitly account for the dependence of judgment on memory. The failure to do so, we argue, can present a misleading picture of judgment processes (cf. Dunlosky & Bjork, 2008). Fortunately, there is a growing trend toward acknowledging the role of memory processes within JDM tasks, with a number of recent models explicitly proposed to account for memory processes. These include query theory (Johnson, Häubl, & Keinan, 2007; Weber et al., 2007), Probex (Juslin & Persson, 2002), the naïve sampling model (Juslin et al., 2007), decision by sampling (Stewart, Chater, & Brown, 2006), and Hilbert’s information theoretic approach (Hilbert, 2012), in addition to HyGene (Thomas et  al. 2008) and Minerva-DM (Dougherty et  al., 1999). These models share the assumption that memory is a primitive component of JDM, but they differ in the degree to which memory processes are actually modeled. Beyond these process models, there are also emerging areas of research within JDM that explicitly acknowledge the role of memory processes, many of which we are beginning to explore within the context of HyGene.

Decisions from experience.

One emerging area of research within the JDM literature concerns the use of past experience to guide decisions—so-called decisions from experience (Hertwig, Barron, Weber, & Erev, 2004; Stewart et  al., 2006). In decisions from experience, participants are assumed to make their decisions by drawing on knowledge or past experiences. Decisions from experience contrast with decisions from givens, where all of the information necessary

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

121

for making a decision is presented to the decision maker, therefore alleviating any memory demands. One particularly influential model of decisions from experience is the naïve sampling model (NSM) proposed by Juslin and colleagues (2007). The NSM makes three key assumptions: (1) people retrieve a sample of items from LTM, (2) this sample is maintained in short-term memory, which is capacity limited, and (3) people uncritically assume that properties of these samples are representative of population parameters. These assumptions allow the NSM to account for the format-dependence phenomena, wherein people tend to be overconfident in tasks in which they must estimate intervals from a distribution, for example the smallest interval within which one is 90% certain that the unemployment rate will fall on date X, but show very little (or no) overconfidence when asked to estimate a probability for a given interval, for example the probability that the unemployment rate will be between 6.5% and 7.5% on date X. According to the NSM a judge samples a set of relevant observations from LTM into short-term memory, and then assumes that properties of that sample are representative of the target population. Since probability estimation is equivalent to estimating a ratio, and since sample ratios are unbiased estimators of population ratios, no computational bias is introduced from probability production. Interval production, on the other hand, requires a judge to estimate dispersion. Sample dispersion is not only a biased estimator of population dispersion but is highly dependent on sampling extreme values in order to obtain a sample dispersion representative (similarly broad) of the population, with the result that sample dispersions tend to be much smaller than that of the population from which a sample is drawn. Since, according to the NSM, judges use both sample ratios and sample dispersion as unbiased estimators of population parameters, interval production will tend to yield significant overconfidence while probability production will tend to yield unbiased estimates (Hansson, Juslin, & Winman, 2003). The NSM can be easily accommodated as a specific instantiation of HyGene. Rather than assume that people maintain a sample of explanatory hypotheses in WM, we assume that the hypotheses correspond to sample of specific events retrieved from LTM. For example, if asked to produce the 25th and 75th percentile of the heights of NBA basketball players, we assume that people generate a sample of heights from LTM using HyGene’s 122

retrieval mechanisms and then use those heights to estimate the specified quantiles. The only change to the HyGene architecture involves augmenting the judgment module (Step 6 of Figure 7.1) to allow for judgments to be based on the values of the events retrieved from memory, rather than estimated via a conditional memory search process.

Hypothesis guided information search and the deployment of attention.

Hypotheses can be viewed as providing people with a sort of expectation. In the context of confidence and probability judgment, these expectations provide an evaluative mechanism for predicting the future. For example, if you are asked to guess who would win next years NCAA basketball tournament, the first thing you might do is generate a short list of candidate teams, and on the basis of that list derive the associated probabilities of each candidate team winning. However, expectancies can also be exploited in a variety of other ways, including anticipating the presence of specific events in your environment. For instance, when searching for one’s keys, one may generate possible locations of where the keys may be located based on previous experience (i.e., where the keys are normally left). Within HyGene we view possible target characteristics (i.e., features and/or locations) retrieved from LTM based on observed cues as hypotheses, and that these hypotheses provide an attentional set on which search is based (Folk, Remington, & Johnston, 1992). In other words, the hypotheses generated from LTM and maintained in WM will guide attentional processes towards events in the world that most closely correspond to the hypothesis (Desimone & Duncan, 1995). The idea that the contents of WM might drive the deployment of attention is not entirely new. Indeed, there are a variety of studies that now illustrate that there is an attentional bias toward items in the perceptual field that correspond with the content of WM (Downing, 2000; Soto, Heinke, Humphreys, & Blanco, 2005; Soto & Humphreys, 2007; Woodman & Luck, 2007). However, in most of these experiments, participants are simply provided with an item to hold in WM prior to the visual search task. HyGene envisions that the contents of WM (the SOC) is the result of the hypothesis generation process. This simple model—that hypotheses derived from LTM influences visual search—provides a powerful mechanistic account of top-down influences on visual search that can be readily applied to

Sources of Bias in Judgment and Decision Making

real world contexts. For example, HyGene anticipates that a doctor who generates one or more diagnostic hypotheses for a particular patient will use those hypotheses to frame the search for additional symptoms, which could confirm or differentiate the hypotheses. Although our work in this area is just beginning, early results examining the interactions between LTM, WM, and attention within the context of a cued recall visual search paradigm have been promising. We have found that the speed of visual search is critically dependent on the diagnosticity of the cue that precedes a search array. The diagnosticity of the cues is critical because it delimits the number of features that may be associated with the target. Thus, a target may only ever be associated with one or two target features given a highly diagnostic cue and could be associated with any feature for a nondiagnostic cue. When explicitly aware of the utility of the cues, more diagnostic cues lead to faster visual search times relative to less diagnostic cues (Buttaccio, Lange, Hahn, & Thomas, 2014). This suggests that participants retrieve and maintain potential target characteristics (hypotheses) and then deploy attention to corresponding features within the visual field. In other words, when given the opportunity to do so, participants engage in hypothesis generation processes to reduce the perceptual demands of visual search. Although the degree to which attention is guided has been questioned in related paradigms (contextual cueing; see Kunar, Flusbert, Horowitz, & Wolfe, 2007), within our paradigm an eye tracking study revealed that a target is more likely to be fixated first in a visual search array when the cue prompting retrieval is of high, as opposed to low diagnosticity, providing evidence of strong attentional guidance (Buttaccio, 2013). For our future empirical work, we plan on extending this general paradigm to tasks beyond that of visual search to examine how the visual environment is foraged in order to winnow down the hypothesis in WM, such as in a medical diagnosis task (Dougherty, Thomas, & Lange, 2010). We argue that visual search and hypothesis testing could be conceived as general cases of information search, and our current theoretical goal is to construct a model of hypothesis generation that can perform information search. Ultimately, we believe that this work will lead to the development of a semi-autonomous decision agent that can generate hypotheses based on incoming information, and then use those hypotheses to direct attention in its visual world.

Summary

We have argued for a memory-theoretic account of overconfidence and that such an account naturally accommodates the many explanations for the phenomenon, including the random error and ecological (sampling) explanations. Within our account, overconfidence occurs because of underlying memory constraints, which lead to systematic biases in beliefs. Embracing this memory view of overconfidence, we showed how the memory processes that underlie closely related biases (e.g., subadditivity) also contribute to overconfidence. Thus, the memory framework allows the concept of overconfidence to extend beyond the two-alternative forced-choice paradigm to more complex contexts in which the alternatives to be evaluated are not provided by a researcher, but must be generated by the decision maker from memory like in medical diagnosis and other real-world domains. Finally, since much of overconfidence derives from the underlying memory constraints, elicitation methods and decision aides developed to mitigate overconfidence bias should be designed to support the underlying memory limitations.

Note

1. This is also a violation of n-ary complementarity, where the sum should normatively sum to 1.0. In the case where N = 2 (binary complementarity), complementarity is frequently observed.

References

Budescu, D. V., Erev, I., & Wallsten, T. S. (1997). On the importance of random error in the study of probability judgment. Part I: New theoretical developments. Journal of Behavioral Decision Making, 10, 157–171. Buttaccio, D. R. (2013). Probabilistic visual search (Unpublished doctoral dissertation). Norman, OK:  University of Oklahoma. Buttaccio, D. R., Lange, N. D., Hahn, S., & Thomas, R. P. (2014). Explicit awareness supports conditional visual search in the retrieval guidance paradigm. Acta Psychologica, 145, 44–53. Clark, S. E., & Gronlund, S. D. (1996). Global matching models of recognition memory: How the models match the data. Psychonomic Bulletin & Review, 3, 37–60. Craik, F. I., Govoni, R., Naveh-Benjamin, M., & Anderson, N. D. (1996). The effects of divided attention on encoding and retrieval processes in human memory. Journal of Experimental Psychology: General, 125(2), 159. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222. Dougherty, M., Thomas, R., & Lange, N. (2010). Toward an integrative theory of hypothesis generation, probability judgment, and hypothesis testing. Psychology of Learning and Motivation, 52, 299–342.

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

123

Dougherty, M. R. (2001). Integration of the ecological and error models of overconfidence using a multiple-trace memory model. Journal of Experimental Psychology: General, 130, 579. Dougherty, M. R., & Franco-Watkins, A. M. (2003). Reducing bias in frequency judgment by improving source monitoring. Acta Psychologica, 113, 23–44. Dougherty, M. R., Gettys, C. F., & Ogden, E. E. (1999). MINERVA-DM: A memory processes model for judgments of likelihood. Psychological Review, 106, 180. Dougherty, M. R. P., Gronlund, S. D., & Gettys, C. F. (2003). Memory as a fundamental heuristic for decision making. In S. L. Schneider & J. Shanteau’s (Eds.), Emerging perspectives on judgment and decision research (pp. 125-164). Cambridge, MA: Cambridge University Press. Dougherty, M. R., & Hunter, J. (2003a). Probability judgment and subadditivity: The role of working memory capacity and constraining retrieval. Memory & Cognition, 31, 968–982. Dougherty, M. R., & Hunter, J. E. (2003b). Hypothesis generation, probability judgment, and individual differences in working memory capacity. Acta Psychologica, 113, 263–282. Dougherty, M. R., Scheck, P., Nelson, T. O., & Narens, L. (2005). Using the past to predict the future. Memory & Cognition, 33, 1096–1115. Dougherty, M. R., & Sprenger, A. (2006). The influence of improper sets of information on judgment:  How irrelevant information can bias judged probability. Journal of Experimental Psychology: General, 135, 262. Downing, P. E. (2000). Interactions between visual working memory and selective attention. Psychological Science, 11, 467–473. DuCharme, W. M. (1970). Response bias explanation of conservative human inference. Journal of Experimental Psychology, 85, 66. Dunlosky, J., & Bjork, R. A. (2008). The integrated nature of metamemory and memory. In J. Dunlosky & R. Bjork (Eds.), A Handbook of Metamemory and Memory (pp. 11–28). New York, NY: Psychology Press. Erev, I., Wallsten, T. S., & Budescu, D. V. (1994). Simultaneous over-and underconfidence:  The role of error in judgment processes. Psychological Review, 101, 519. Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with certainty:  The appropriateness of extreme confidence. Journal of Experimental Psychology:  Human Perception and Performance, 3, 552. Folk, C. L., Remington, R. W., & Johnston, J. C. (1992). Involuntary covert orienting is contingent on attentional control settings. Journal of Experimental Psychology: Human Perception and Performance, 18, 1030. Gigerenzer, G., Hoffrage, U., & Kleinbölting, H. (1991). Probabilistic mental models: A Brunswikian theory of confidence. Psychological review, 98, 506. Griffin, D., & Tversky, A. (1992). The weighing of evidence and the determinants of confidence. Cognitive Psychology, 24, 411–435. Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality:  The recognition heuristic. Psychological Review, 109, 75. Google Scholar, Retrieved October 30, 2013 from http://scholar. google.com. Hansson, P., Juslin, P., & Winman, A. (2003). Naïve sampling and format dependence in subjective probability calibration. In Proceedings of the 25th Annual Meeting of the Cognitive Science Society.

124

Hansson, P., Juslin, P., & Winman, A. (2008). The role of short-term memory capacity and task experience for overconfidence in judgment under uncertainty. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 34, 1027. Harbison, J., Dougherty, M. R., Davelaar, E. J., & Fayyad, B. (2009). On the lawfulness of the decision to terminate memory search. Cognition, 111, 397–402. Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect of rare events in risky choice. Psychological Science, 15, 534–539. Hertzog, C., Fulton, E. K., Sinclair, S. M., & Dunlosky, J. (2014). Recalled aspects of original encoding strategies influence episodic feelings of knowing. Memory & cognition, 42, 126–140. Hilbert, M. (2012). Toward a synthesis of cognitive biases: How noisy information processing can bias human decision making. Psychological Bulletin, 138, 211. Hintzman, D. L. (1984). MINERVA2: A simulation model of human memory. Behavior Research Methods, Instruments, and Computers, 16, 96-101. Intelligence Advanced Research Projects Activity (IARPA). (2014). Aggregative Contingent Estimation (ACE). Retrieved from http://www.iarpa.gov/index.php/research-programs/ace Johnson, E. J., Häubl, G., & Keinan, A. (2007). Aspects of endowment:  a query theory of value construction. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 461. Juslin, P. (1994). The overconfidence phenomenon as a consequence of informal experimenter-guided selection of almanac items. Organizational Behavior and Human Decision Processes, 57, 226–246. Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX):  A  “lazy” algorithm for probabilistic inference from generic knowledge. Cognitive Science, 26, 563–607. Juslin, P., Wennerholm, P., & Olsson, H. (1999). Format dependence in subjective probability calibration. Journal of Experimental Psychology:  Learning, Memory, and Cognition, 25, 1038. Juslin, P., Winman, A., & Hansson, P. (2007). The naïve intuitive statistician: A naïve sampling model of intuitive confidence intervals. Psychological Review, 114, 678. Kahneman, D. & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291. Keren, G. (1987). Facing uncertainty in the game of bridge: A calibration study. Organizational Behavior and Human Decision Processes, 39, 98–114. Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20, 159–183. Lichtenstein, S., & Fischhoff, B. (1980). Training for calibration. Organizational Behavior and Human Performance, 26, 149–171. Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1977). Calibration of probabilities:  The state of the art:  New  York, NY: Springer. Matvey, G., Dunlosky, J., & Guttentag, R. (2001). Fluency of retrieval at study affects judgments of learning (JOLs):  An analytic or nonanalytic basis for JOLs? Memory & Cognition, 29, 222–233. Moore, D. A., & Healy, P. J. (2008). The trouble with overconfidence. Psychological Review, 115, 502.

Sources of Bias in Judgment and Decision Making

Nelson, T. O., & Dunlosky, J. (1991). When people’s judgments of learning (JOLs) are extremely accurate at predicting subsequent recall: The “delayed-JOL effect”. Psychological Science, 2(4), 267–270. Nelson, T. O., & Dunlosky, J. (1992). How shall we explain the delayed-judgment-of-learning effect? Psychological Science, 3(5), 317–318. Phillips, L. D., & Edwards, W. (1966). Conservatism in a simple probability inference task. Journal of Experimental Psychology, 72, 346. Pleskac, T. J., Dougherty, M. R., Rivadeneira, A. W., & Wallsten, T. S. (2009). Random error in judgment: The contribution of encoding and retrieval processes. Journal of Memory and Language, 60, 165–179. Rosen, V. M., & Engle, R. W. (1997). The role of working memory capacity in retrieval. Journal of Experimental Psychology: General, 126, 211. Soto, D., Heinke, D., Humphreys, G. W., & Blanco, M. J. (2005). Early, involuntary top-down guidance of attention from working memory. Journal of Experimental Psychology: Human Perception and Performance, 31, 248. Soto, D., & Humphreys, G. W. (2007). Automatic guidance of visual attention from verbal working memory. Journal of Experimental Psychology: Human Perception and Performance, 33, 730. Sprenger, A., & Dougherty, M. R. (2006). Differences between probability and frequency judgments: The role of individual differences in working memory capacity. Organizational Behavior and Human Decision Processes, 99, 202–211.

Sprenger, A., & Dougherty, M. R. (2012). Generating and evaluating options for decision making:  The impact of sequentially presented evidence. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38, 550. Sprenger, A. M., Dougherty, M. R., Atkins, S. M., Franco-Watkins, A. M., Thomas, R. P., Lange, N., … Atkins, S. (2011). Implications of cognitive load for hypothesis generation and probability judgment. Frontiers in Psychology, 2, 129. Stewart, N., Chater, N., & Brown, G. D. (2006). Decision by sampling. Cognitive Psychology, 53, 1–26. Thomas, R. P., Dougherty, M. R., Sprenger, A. M., & Harbison, J. (2008). Diagnostic hypothesis generation and human judgment. Psychological Review, 115, 155. Tversky, A., & Koehler, D. J. (1994). Support theory:  A  nonextensional representation of subjective probability. Psychological Review, 101, 547. Weber, E. U., Johnson, E. J., Milch, K. F., Chang, H., Brodscholl, J. C., & Goldstein, D. G. (2007). Asymmetric discounting in intertemporal choice a query-theory account. Psychological Science, 18, 516–523. Windschitl, P. D., & Wells, G. L. (1998). The alternative-outcomes effect. Journal of Personality and Social Psychology, 75, 1411. Woodman, G. F., & Luck, S. J. (2007). Do the contents of visual working memory automatically influence attentional selection during visual search? Journal of Experimental Psychology: Human Perception and Performance, 33, 363.

Tidwell, But taccio, Chrabaszcz, Doughert y, and thomas

125

CH A PT E R

8

The Self-Consistency Theory of Subjective Confidence

Asher Koriat and Shiri Adiv

Abstract Innumerable studies have yielded a positive correlation between subjective confidence and accuracy, suggesting that people are skillful in discriminating between correct and wrong answers. The chapter reviews evidence from different domains indicating that people’s subjective confidence in an answer is diagnostic of the consensuality of the answer rather than of its accuracy. A self-consistency model (SCM) was proposed to explain why the confidence-accuracy correlation is positive when the correct answer is the consensually chosen answer but is negative when the wrong answer is the consensual answer. Several results that were obtained across a variety of tasks provided support for the generality of the theoretical framework underlying SCM. Key Words:  subjective confidence, confidence-accuracy relationship, self-consistency model, consensuality principle, Wisdom of crowds, overconfidence

When people are asked to answer a question or to solve a problem, they can indicate their confidence that the answer or solution is correct. Confidence judgments have been used and investigated in a wide range of domains. These domains include perception and psychophysics, memory and metacognition, decision-making and choice, eyewitness testimony, scholastic achievement and intelligence, social cognition, neuroscience, and animal cognition. Of course, philosophers have also been concerned with the issue of how we can be sure about the truth of assertions (e.g., BonJour, 1985; Engel, 1998). Statisticians also examined these questions from a normative perspective, focusing on the degree of confidence in conclusions that are based on empirical observations (Fisher, 1925; Lykken, 1968). In experimental settings, the collection of confidence judgments was used for different goals. In perception and psychophysics, confidence judgments have been used to explore different quantitative theories of the processes underlying

psychophysical judgments (Vickers, Smith, Burt, & Brown, 1985; Wixted, & Mickes, 2010). Forensic psychologists have focused primarily on questions regarding the validity of confidence as a diagnostic cue of the accuracy of a testimony (Bothwell, Deffenbacher, & Brigham, 1987; Read, Lindsay, & Nicholls, 1998; Sporer, Penrod, Read, & Cutler, 1995). Among social psychologists and memory researchers, confidence judgments have attracted attention specifically because these judgments have been found to moderate the likelihood of translating one’s beliefs into behavior (Ross, 1997; Tormala & Rucker, 2007; Yzerbyt, Lories, & Dardenne, 1998). Vickers (2001), however, complained that “the variable of confidence seems to have played a Cinderella role in cognitive psychology—relied on for its usefulness, but overlooked as an interesting variable in its own right” (p. 148). Fortunately, there has been increased interest in the study of subjective confidence in its own right, including the processes underlying confidence judgments, and the determinants of their accuracy and inaccuracy. 127

Core Questions

Three issues constitute the core issues in metacognition research on subjective confidence. The first concerns the correspondence between confidence and performance: How faithful are confidence judgments in mirroring object-level performance? Second, what are the processes underlying the subjective feeling of certainty and doubt? Finally, given the postulated bases of confidence, how do these bases explain the accuracy and inaccuracy of confidence judgments under different conditions?

The accuracy of confidence judgments.

The first question, which has received a great deal of research, concerns the accuracy of confidence judgments (e.g., Dunning, Heath, & Suls, 2004; Liberman & Tversky, 1993; Lichtenstein, Fischhoff, & Phillips, 1982). Researchers in the area of judgment and decision-making (see Lichtenstein et al., 1982; Murphy, 1973) have provided a methodology for deriving different scores that convey information about two aspects of metacognitive accuracy, calibration and resolution (see Dunlosky, Mueller, & Thiede, this volume; Higham, Zawadzka, & Hanczakowski, this volume(. Calibration (“bias”) or “absolute accuracy” (see Nelson & Dunlosky, 1991) refers roughly to the correspondence between mean metacognitive judgments and mean actual performance, and reflects the extent to which confidence judgments are realistic or disclose overconfidence bias (inflated confidence relative to performance) or underconfidence bias. Calibration can be evaluated only when judgments and performance are measured on equivalent scales. Such is not the case for the second aspect of metacognitive accuracy, resolution (or relative accuracy). Resolution refers to the extent to which metacognitive judgments are correlated with memory performance across items. This aspect, which is commonly indexed by a within-subject gamma correlation between judgments and performance (Nelson, 1984) reflects the ability to discriminate between correct and incorrect answers. Researchers in the area of judgment and decision-making have provided a methodology for deriving different scores from calibration curves (see Lichtenstein et al., 1982; Murphy, 1973). There has been a division of labor in studies of confidence accuracy such that the work within the judgment and decision tradition has focused on the calibration of subjective probabilities (e.g., Griffin & Brenner, 2004). In contrast, the work in metacognition by cognitive psychologists has focused primarily on resolution—the discrimination between correct 128

and wrong answers or judgments (see Koriat, 2007; Metcalfe & Dunlosky, 2008). The observation that people can tell when they are right and when they are wrong has been among the steering forces for the upsurge of interest in metacognition (Hart, 1965; Koriat, 1993; Nelson & Dunlosky, 1991; Tulving & Madigan, 1970). Surprisingly, this observation has received relatively little attention among students of judgment and decision-making despite the fact that virtually all calibration curves reported in the experimental literature are monotonically increasing, suggesting good resolution (see Keren, 1991). In fact, in studies of recognition memory, it has been noted that low-confidence decisions are associated with close-to-chance accuracy, whereas high-confidence decisions tend to be associated with close-to-perfect accuracy (Mickes, Hwe, Wais, & Wixted, 2011). Similarly, studies that collected confidence judgments in a variety of tasks, have generally yielded moderate to high within-person confidence/accuracy (C/A) across items (e.g., Brewer, Keast, & Rishworth, 2002; Lindsay, Wells, & Rumpel, 1981). However, the extensive research on “assessed probabilities” has focused on patterns of miscalibration (e.g., Griffin & Brenner, 2004), taking for granted the accuracy of monitoring resolution. Furthermore, within the judgment and decision tradition there seems to have been an implicit assumption that assessed probabilities ought to be perfectly calibrated, and hence the challenge is to explain deviations from perfect calibration. In metacognition research, in contrast, one of the research goals has been to uncover the bases of confidence judgments and to explain why these judgments are largely accurate.

The bases of confidence judgments.

The second question about confidence judgments concerns their bases. Three theoretical approaches to this question have been distinguished:  the direct-access approach, the information-based approach, and the experience-based approach (Koriat, 1997; Koriat & Levy-Sadot, 1999). Research on the possible bases of confidence judgments seems to have been hampered by the implicit endorsement of the direct-access view according to which these judgments mirror directly memory strength. For example, in strength theories of memory, the assumption is that confidence judgments are scaled from the strength or quality of the internal memory representation (see Van Zandt, 2000, for a review). The direct-access view has resulted in taking the accuracy of confidence judgments for granted.

The Self-Consistency Theory of Subjective Confidence

The direct-access approach is perhaps best represented in the philosophy of knowledge by the claims of rationalist philosophers that a priori truths (e.g., mathematical propositions) are based on intuition and deduction, and that their certainty is self-evident (see Koriat, 2012b). This approach seems to find some expression in the experimental literature on confidence in the idea that some answers and their associated strong confidence are based on “direct retrieval” (Juslin, Winman, & Olsson, 2003; Metcalfe, 2000; Unkelbach & Stahl, 2009). Whether the response to such questions (e.g., “what is your name?” see Koriat, 2012b) should be assumed to involve a non-inferential basis is still an open question. In contrast to the direct-access view, most researchers in metacognition lean towards the assumption that metacognitive judgments are inferential in nature, relying on a variety of beliefs and heuristics that may be applied under different conditions (see Benjamin & Bjork, 1996; Koriat, 1997; Koriat, Ma’ayan, & Nussinson, 2006). A  distinction is drawn, however, between information-based and experience-based judgments (Kelley & Jacoby, 1996; Koriat, Nussinson, Bless, & Shaked, 2008). Information-based judgments rely on an analytic inference in which various considerations are consulted to reach an educated judgment. This view is in the spirit of the reason-based approach of Shafir, Simonson and Tversky (1993). They argued that when faced with the need to choose, people often seek and construct reasons in order to resolve the conflict and justify their choice. For example, confidence in two-alternative forced-choice (2AFC) general-knowledge questions was claimed to rest on the reasons recruited in favor of the two answers (e.g., Griffin & Tversky, 1992; Koriat, Lichtenstein, & Fischhoff, 1980; McKenzie, 1997). Experience-based judgments, in contrast, are based on mnemonic cues that derive on-line from task performance rather than being based on the content of domain-specific declarative information retrieved from long-term memory. For example, confidence judgments are said to rest on the ease with which information comes to mind or on the speed with which an answer is selected among distractors (e.g., Kelley & Lindsay, 1993; Koriat et al., 2006; Robinson, Johnson, & Herndon, 1997). The heuristics that shape subjective confidence are assumed to operate largely below full consciousness (Koriat, 2000).

The reasons for the accuracy and inaccuracy of confidence judgments.

The third question concerns the processes underlying the accuracy and inaccuracy of confidence

judgments. As noted earlier, the direct-access view to confidence judgments take the accuracy of these judgments for granted to the extent that confidence judgments are assumed to convey information about object-level processes. Inferential approaches to confidence, in contrast, are faced with the challenge of explaining the accuracy of confidence judgments. There has been some debate in the literature regarding the validity of some of the findings documenting systematic discrepancies between confidence and performance. On the one hand, proponents of the ecological probability approach (Dhami, Hertwig, & Hoffrage, 2004; Gigerenzer, Hoffrage, & Kleinbölting, 1991) argued that some of these discrepancies are artifactual, deriving from the failure of researchers to follow the dictum of a representative research design (Brunswik, 1956). Thus, they argued that the overconfidence bias (Hoffrage, 2004) and the hard-easy effect (Griffin & Tversky, 1992) that had been observed in studies of confidence stem from researchers’ failure to sample items so that they are representative of the natural environment. Indeed, several studies that used a set of items that were randomly selected from a circumscribed domain of knowledge found little evidence for overconfidence bias or for the hard-easy effect (Gigerenzer et  al., 1991; Juslin, 1993, 1994). On the other hand, among researchers in metacognition, the cue-utilization view has led to a deliberate focus on the inaccuracies of metacognitive judgments, in general, and confidence judgments, in particular. A  large number of studies documented systematic discrepancies between subjective and objective indexes of knowledge. Koriat, Pansky, and Goldsmith (2011) argued that the difference between the two lines of research, one emphasizing a representative design, and another focusing on metacognitive illusions, reflect a difference in research agendas. The first agenda is to obtain a faithful description of the state of affairs in the real world. This agenda requires that the experimental conditions should be representative of conditions and variations in the real world. The second agenda is to achieve a theoretical understanding of the phenomena and their underlying mechanisms. This agenda, in contrast, sometimes calls precisely for the use of conditions that are ecologically unrepresentative, even contrived, in order to untangle variables that go hand in hand in real life (see Koriat, 2012a). Indeed, in metacognition research, researchers have sometimes deliberately focused on factors Koriat and Adiv

129

that lead metacognitive judgments astray (e.g., Benjamin, Bjork, & Schwartz, 1998; Brewer & Sampaio, 2006; Busey, Tunnicliff, Loftus, & Loftus, 2000; Chandler, 1994; Koriat, 1995; Rhodes & Castel, 2008).

The Motivation for the Present Proposal

The motivation for the self-consistency model of subjective confidence derived initially from the results of an old study (Koriat, 1975) that examined the C/A relationship in a phonetic symbolism task. In previous studies (e.g., Slobin, 1968), participants were asked to match antonymic pairs from noncognate languages (e.g., tuun-luk) with their English equivalents (deep-shallow). The results indicated that people’s matches are significantly better than chance. Koriat (1975) examined whether participants can also monitor the correctness of their matches, and asked participants to indicate their confidence in each match. Participants’ object-level accuracy was significantly better than chance:  Participants’ matches were correct in 58% of the cases. In addition, their meta-level accuracy was also significant:  The percentage of correct matches increased steeply with confidence judgments, suggesting that participants were successful in monitoring the correctness of their matches. The latter result presented a puzzle. Neither the information-based approach nor the experience-based approach offers a hint regarding the cues that participants might use to monitor the correctness of their matches. The finding is reminiscent of the direct-access view that rationalists posit with regard to a priori propositions that are accessed through intuition. In an attempt to explain the high C/A correlation, Koriat (1976, see Study 1 in Table 8.1) suggested that perhaps the observation that participants’ matches are largely accurate (“knowledge”) might create a confounding for the assessment of the C/A correlation (“metaknowledge”). That is, the correct match is the one that is consensually endorsed, so confidence judgments might actually be correlated with the consensuality of the match rather than with its correctness. Indeed, the results of a subsequent study (Koriat, 1976) confirmed that possibility. In that study, a deliberate effort was made to include a large proportion of items for which participants would be likely to agree on the wrong match. The items were classified post-hoc into three classes according to whether the majority of participants agreed on the correct match (consensually-correct; CC), 130

agreed on the wrong match (consensually-wrong; CW), or did not agree on either match (nonconsensual; NC). The results clearly indicated that confidence judgments correlated with the consensuality of the match rather than with its correctness: For the CC class, correct matches were endorsed with stronger confidence than were wrong matches, whereas for the CW class, wrong matches were actually associated with stronger confidence than were correct matches. For the NC class, confidence was unrelated to the correctness of the match. This interactive pattern was referred to as the consensuality principle (Koriat, 2008). This pattern was found to be true for several domains, as will be reviewed. The results suggest that the positive C/A correlation that has been observed in a great number of studies is actually because in practically all of these studies participant were more often correct than wrong (i.e., the great majority of items are CC items). Consider, for example, studies of confidence judgments in 2AFC general-information questions. Participants’ proportion of correct answers is typically well above .50, and rarely does any of the questions yield more wrong answers than correct answers. The latter questions were sometimes referred to as “deceptive.” “misleading,” or “unrepresentative” (Fischhoff, Slovic, & Lichtenstein, 1977; Gigerenzer, et  al., 1991). Similarly, in psychophysical experiments, judgments tend to be largely accurate with the exception of occasional errors that are not correlated across participants (see Juslin & Olsson, 1997). As a result, the C/A correlation for such questions is typically assessed only across half of the range of proportion correct (.51–1.00), and the range between 0 and .50 is hardly represented. Before we describe the model, we should say a few words about the methodology of the studies on which it was based. In each of these studies, participants answered a series of 2AFC questions. For each question, they chose one answer and indicated their confidence. As noted earlier, SCM was initially motivated by attempts to clarify the accuracy of confidence judgments. However, the results led to the question of the basis of these judgments. Because this question applies also to domains in which the response does not have a truth-value, SCM was extended to the investigation of the process underlying confidence judgments in such domains as social attitudes and social beliefs, personal preferences, and the category membership decisions.

The Self-Consistency Theory of Subjective Confidence

The Self-Consistency Model of Subjective Confidence

SCM adopts the metaphor of an intuitive statistician underlying human decision and choice (Peterson & Beach, 1967; see McKenzie, 2005). It assumes that the process underlying choice and confidence is analogous to that in which information is sampled from the outside world with the intention (a) to test a hypothesis about a population and (b) to assess the likelihood that the conclusion reached is correct. It was proposed that when presented with a 2AFC item, it is by replicating the choice process several times that one can appreciate the degree of doubt or certainty involved. Subjective confidence is based on the consistency with which different replications agree in favoring one of the two choices. It represents essentially an assessment of reproducibility—the likelihood that a new replication of the decision process will yield the same choice. Thus, reliability is used by participants as a cue for validity. This is very much like statistical inference when conclusions about a population are based on a sample of observations drawn from that population (Koriat, 2012a). Thus, SCM incorporates a sampling assumption that is common in many decision models (e.g., Juslin & Olsson, 1997; Stewart, 2009; Stewart, Chater, & Brown, 2006; Vickers & Pietsch, 2001; Vul, Goodman, Griffiths, & Tenenbaum, 2009). When presented with a 2AFC item, participants are assumed to sample a number of representations from a population of potential representations associated with the item. The term “representation” was used as an abstract term that may apply to different 2AFC tasks. It may include a specific consideration (Koriat, Lichtenstein, & Fischhoff, 1980), a particular interpretation or framing of a choice problem (Tversky & Kahneman, 1981), a “cue” that is used to infer the answer (Gigerenzer et al., 1991), or any hunch or association that may tip the balance in favor of one choice rather than the other. Because of the limitations of the cognitive system, the number of representations sampled on each occasion must be quite limited, because of the need to integrate information across representations. Participants are assumed to draw the implications of each representation, and reach an ultimate decision based on the balance of evidence in favor of the two options (Vickers, 2001; see Baranski & Petrusic, 1998). Once a choice has been made, confidence is based primarily on self-consistency—the general agreement among the

sampled representations in favoring the decision reached. In SCM, self-consistency is conceptualized as a contentless cue that reflects the mere number of pro and con considerations associated with the choice irrespective of their meaning and importance (see Alba & Marmorstein, 1987). Clearly, the type of representations retrieved in making a choice should differ depending on the domain of the question. However, SCM assumes that the gross architecture of the process is similar across a variety of 2AFC tasks. An important assumption of SCM is that in responding to 2AFC items, whether they involve general-information questions or beliefs and attitudes, participants with the same experience draw representations largely from the same, commonly shared population of representations associated with each item. Thus, although the specific samples drawn on each occasion may differ for different individuals and for each individual on different occasions, people draw their clues from a pool of clues that is largely commonly shared. In the case of general-information and perceptual judgments proponents of the ecological approach to cognition (Dhami et al., 2004; Gigerenzer, 2008; Juslin, 1994) have stressed the general accuracy of the shared knowledge, which is assumed to result from the adaptation to the natural environment. In addition, the wisdom-of-crowds phenomenon suggests that information that is aggregated across participants is generally closer to the truth than the information provided by each individual participant (Galton, 1907; Mozer, Pashler, & Homaei, 2008; Wallsten, Budescu, Erev, & Diederich, 1997). Thus, we assume that the ingredients that participants use to construct their decisions are drawn from a collective “wisdom.” This is the reason for the observation that confidence judgments are diagnostic of the consensuality of the choice.

Implementation of SCM for the Basis of Confidence Judgments

In what follows, we present a specific instantiation of the model that is clearly oversimple but is sufficient for bringing to the fore the main predictions of SCM. In this instantiation we assume the following:  (1)  For each 2AFC item, a maximum number of representations (nmax) is sampled randomly. (2) Each representation yields a binary subdecision, favoring one of the two options. (3) When a sequence of a preset number (nrun) of representations yields the same subdecision, the sampling is stopped, and that subdecision dictates the choice Koriat and Adiv

131

(see Audley, 1960). (4) Each subdecision makes an equal contribution to the ultimate, overt decision and to a self-consistency index, which is assumed to underlie subjective confidence. To examine the implications of the model, a simulation experiment was run (see Koriat, 2012a; Koriat & Adiv, 2011) in which nmax was set at 7. Also, nrun was set at 3, so that the actual size of the sample (nact) underlying choice and confidence could vary between 3 and 7. Assume that each item is characterized by a probability distribution, with pmaj denoting the probability that a representation favoring the majority choice will be sampled. This probability can be seen as a property of a binary choice item. A simulation experiment was run which assumed nine binomial populations that differ in pmaj, with pmaj varying from .55 to .95, at .05 steps. For each population, 90,000 iterations were run in each of which a sample of (3-7) representations was drawn. The ultimate choice was classified as “majority” when it corresponded to the majority value in the population (the one that is consistent with pmaj), and as “minority” when it corresponded to the minority value in the population. A self-consistency index was calculated for each iteration, which is inversely related to the sample ˆ ˆ (range standard deviation. It was defined as 1− pq .5–1.0), when p and q designate the proportion of representations favoring the two choices, respectively. Based on the results of the simulation, Figure 8.1 presents the self-consistency index, which is assumed to underlie subjective confidence, for 1.0

majority and minority choices and for all choices combined as a function of pmaj. Self-consistency increases monotonically with pmaj, but more important, self-consistency is higher for majority than for minority choices. This is because as long as pmaj >.50, majority choices will be supported by a larger proportion of the sampled representations than minority choices. For example, for pmaj  =  .70, and sample size = 7, the likelihood that six or seven representations will favor the majority choice is .329, whereas only in .004 of the samples will six or seven representations favor the minority choice. Thus, the expectation is that confidence should be higher for majority choices than for minority choices. Of course, pmaj for a particular item is not known. However, it can be estimated from pcmaj—the probability with which the majority alternative is chosen. The theoretical function relating pcmaj to pmaj can be obtained from the simulation just described. pcmaj is an accelerated function of pmaj (see Figure 8.1; Koriat, 2012a). This probability can be indexed operationally for each item by (a)  the proportion of participants who choose the preferred alternative (“item consensus”) or by (b) the proportion of times that the same participant chooses his or her most frequent alternative across several presentations of the item (“item consistency”). Turning next to nact, the number of representations actually drawn, the simulation experiment mentioned earlier indicated that the results for nact mimic very closely those obtained for self-consistency. Assuming

Majority Minority All

.95 .90 Self-Consistency

.85 .80 .75 .70 .65 .60 .55 .50

.50

.60

.70

pmaj

.80

.90

1.0

Figure 8.1  Self-consistency scores as a function of the probability of drawing a majority representation (Pmaj) based on the results of the simulation experiment. Reproduced with permission from Koriat and Adiv (2011). Copyright 2011 by Guilford Press.

132

The Self-Consistency Theory of Subjective Confidence

that response speed is an inverse function of nact, then response speed should be faster for majority than for minority choices and should vary as a function of pmaj and pcmaj in much the same way as should confidence judgments (see Koriat, 2012a). In sum, the basic predictions of SCM are as follows:  Confidence and response speed should increase with item consensus—the agreement between participants in making the consensual choice for each item. The same is true for item consistency—the within-person agreement in making the more frequent choice. Item consensus and item consistency are assumed to reflect the polarity of the population of representations associated with each item, and this polarity is assumed to constrain the variability that can be observed in binary decisions for each item. However, when variability in the response choice is observed, confidence and response speed should differ depending on which alternative is chosen: When the decision reached is the decision that accords with that of most other participants, confidence and response speed should be higher than when the decision is a nonconsensual decision. Similarly, in a repeated presentation design, confidence and response speed should be higher for the more frequent response than for the less frequent response. It should be stressed that these predictions are based on the assumption that the same process underlies consensual/frequent decisions and nonconsensual/rare decisions: In each case, each participant chooses the response that is favored by the majority of representations in the sample of representations that he/she has retrieved.

Empirical Evidence

In what follows, we present a brief review of the results of several studies that provided a test of the predictions derived from SCM. The aim of some of these studies was to examine the bases of people’s subjective confidence and the reasons for their accuracy and inaccuracy. Other studies additionally attempted to use confidence judgments as a tool that could provide insight into the process underlying people’s construction of their attitudes, beliefs, preferences, predictions, and category membership decisions (Koriat, 2013; Koriat & Adiv, 2011, 2012; Koriat & Sorka, 2015). We first describe the general methodology used in these studies.

Overview of the Methodology and Analytic Procedure

The procedure in the studies to be reviewed was similar except for the domains of the items used.

Participants were presented with a series of 2AFC questions. For each question, they chose one answer and indicated their confidence in their choice either on a full-range scale (0–100) or on a half-range scale (50–100). Response latency was also measured, representing the time it took participants to reach a decision. In all of the studies reviewed in this chapter, participants performed the tasks individually, and had no direct access to the responses of other participants. The same analytic procedure was applied to the results of all studies (see also Bassili, 2003; Huge & Glynn, 2013). First, the two alternative answers to each item were defined post hoc as majority and minority responses on the basis of the distribution of the responses across all participants (items with ties were eliminated). Confidence and response latency were then averaged separately for the majority and minority responses. All studies provided data regarding the effects of between-individual consensus. In these studies, item consensus was defined as the proportion of participants making the majority choice. Item consensus was seen as an index of pcmaj. In some studies, the task was repeated several times, between 5 and 7, usually across several sessions that took place on separate days. In these studies, the analyses from the first presentation provided a test of the predictions concerning between-individual consensus, but the analyses across different presentations provided a test of the predictions concerning within-individual consistency. In the latter analyses, the number of times that each of the two responses was made to each item was determined for each participant. The two responses were then classified as frequent or rare according to their relative frequency across presentations. Item consistency was defined as the proportion of times that the frequent choice was made by the person across the repeated presentations of the item, and was used as an alternative index of pcmaj. For some of the tasks used, such as those measuring attitudes and beliefs, the answers do not have a truth-value. These tasks allowed us to test predictions about the basis of confidence judgments, but not about their accuracy. Other tasks, for which the answers have a truth-value, provided, in addition, a test of predictions regarding the accuracy of confidence judgments. These tasks included word matching, general-information, perceptual comparison, and the prediction of others’ responses. Table 8.1 lists the studies to be reviewed, and the tasks used in these studies. For each study, it indicates whether the answers have a truth-value, and Koriat and Adiv

133

Table  8.1  The studies reviewed in  this chapter. For each study, the  table presents an  example of  an item. It indicates whether the items have a truth-value, and lists the number of items, participants, and presentations used in that study Study

Example of Item

TruthValue?

Number of items

Number of Participants

Number of Presentations

1. Word Matching (Koriat, 1976)

Beautiful Chou

Yes

85

100

1

2. General Knowledge (Koriat, 2008)

What actress played Dorothy in the original version of the movie The Wizard of Oz?

Yes

105

41

1

Ugly Mei

(a)  Judy Garland, (b)  Greta Garbo 3. Perceptual—Lines (Koriat, 2011)

Which of the two lines is longer?  

Yes

40

39

5

4. Perceptual—Shapes (Koriat, 2011)

Which of the two geometric shapes has a larger area?  

Yes

40

41

5

5. Predictions of Others’ Preferences (Koriat, 2013)

Which sport activity would be preferred by most others?

Yes

60

41

1

(a) jogging, (b) swimming 6. Natural Category Membership (Koriat & Sorka, 2015)

Do olives belong to the fruit category?

No

100

33

7

7. Beliefs (Koriat & Adiv, 2012)

There is a supreme being controlling the universe

No

60

41

6

No

50

41

7

No

60

41

5

True False 8.  Attitudes (Koriat & Adiv, 2011)

Capital punishment

9. Personal Preferences (Koriat, 2013)

Which sport activity would you prefer?

Yes No

(a) jogging, (b) swimming

hence whether the answers could be scored as correct or wrong. The table also indicates the number of items and participants, the confidence scale used, and the number of presentations. 134

Let us now review the basic findings. We begin with the results for between-person consensus, and then turn to those of within-individual consistency. These results are pertinent to the idea that subjective

The Self-Consistency Theory of Subjective Confidence

confidence is based on self-consistency. We then examine the question of the accuracy of subjective confidence. We review the findings regarding the predictions of SCM with regard to metacognitive resolution and metacognitive calibration. SCM will be shown to provide a principled account for observations pertaining to both aspects of the C/A correspondence. We end by examining some general implications of the SCM-based results regarding confidence judgments.

The Relationship of Confidence and Response Latency to Cross-Person Consensus

As noted earlier, pcmaj can be indexed by the proportion of participants who choose the majority, consensual answer for each item. To test the predictions of SCM, the following item-based analysis was used. For each item, the answer that was chosen by the majority of participants was designated as the consensual answer, and the other as the nonconsensual answer. Mean confidence was then plotted as a function of item consensus. This was done separately for consensual and nonconsensual answers. We will illustrate the findings by the results obtained in the study of general-information questions (Koriat, 2008, see Study 2 in Table 8.1) and then indicate how these findings were replicated for other tasks. In that study, 105 2AFC general-knowledge questions were used. All answers were one or two words long, either a concept or a name of a person or a place. This format was important for the measurement of choice latency (see later). In addition, the questions were chosen deliberately to yield a large number of CW items, for which the wrong answer was likely to be the consensual, majority answer. Confidence was measured. Figure 8.2A presents mean confidence judgments for each of six item-consensus categories for both consensual and nonconsensual answers (for one item all participants chose the majority answer). Several trends are suggested by the results: 1. Mean overall confidence judgments (“All” in Figure 8.2A) increased monotonically with increasing item consensus. When mean confidence and mean item consensus were calculated for each item, the correlation between them over all 105 items was .505, p < .0001. 2. However, consensual answers were endorsed with higher confidence (M = 70.9%) than nonconsensual answers (M = 64.6%), t(103) = 6.74, p < .0001, and this was true regardless of the accuracy of these answers.

This difference was consistent across items: For 78 items, confidence was higher for the consensual answer than for the nonconsensual answer compared with 26 items in which the pattern was reversed, p < .0001, by a binomial test. 3. It should be noted that in this study, like in all other studies, there were marked and reliable individual differences in the tendency to make relatively high or relatively low confidence judgments (see Kleitman & Stankov, 2001; Stankov & Crawford, 1997). Because the confidence means for consensual and nonconsensual answers in Figure 8.2A were based on different participants for each item, the differences between these means may reflect a between-individual effect: Participants who tend to choose consensual answers tend to be more confident. To control for inter-participant differences in confidence, the confidence judgments of each participant were standardized so that the mean and standard deviation of each participant were set as those of the raw scores across all participants. Average scores were then calculated for each item for consensual and nonconsensual answers. The consensualnonconsensual differences were practically the same for the standardized confidence scores. 4. The same general difference between consensual and nonconsensual answers was obtained in subject-based analyses. In these analyses, confidence was compared for each participant between consensual and nonconsensual answers. The results indicated that participants were more confident in their response when that response agreed with the consensual, majority response (72.31%) than when it departed from it (64.36%), t(40) = 14.79, p < .0001. All 41 participants exhibited this pattern, p < .0001, by a binomial test. 5. The moderating effect of item consensus for confidence: We expected the difference in confidence between consensual and nonconsensual responses to increase with item consensus (see Figure 8.1B). This increase can be seen in 2A but its statistical significance could not be tested on the results presented in that figure because each of the means for the consensual and nonconsensual functions was based on a different combination of participants. However, we calculated for each participant the functions depicted in Figure 8.2A relating mean confidence in consensual and nonconsensual responses to grouped item Koriat and Adiv

135

A 100 Majority Answer Minority Answer All

Confidence

90

80

70

60 n=36 50

n=18

n=28 .60

.50

n=15

.70 .80 Item Consensus

n=7 .90

n=1 1.0

B 2.5 Majority Answer Minority Answer All

Choice Latency

3.5

4.5

5.5

6.5

7.5

n=36 .50

n=28 .60

n=18 .70

n=15 .80

n=7 .90

n=1 1.0

Item Consensus Figure 8.2  Panel A: Mean confidence in the correctness of answers to general-information questions for majority and minority answers and for all responses combined as a function of item consensus (the proportion of participants who chose the majority answer). Panel B presents mean choice latency as a function of item consensus for majority answers, minority answers and for all answers combined. Indicated in the figure is also the number of items (n) in each item consensus category. The results are based on a reanalysis of the data of Koriat (2008). Reproduced with permission from Koriat (2011). Copyright © 2012 by the American Psychological Association.

consensus categories. The rank order correlation between the ordinal value of the item consensus category (1 to 6) and the difference in mean confidence between consensual and nonconsensual responses (using for each participant the 136

observations for which this difference was computable) averaged .55 across participants, p < .0001. This correlation was positive for 35 of the 40 participants (one had a tie), p < .0001, by a binomial test.

The Self-Consistency Theory of Subjective Confidence

6. We turn next to the results for response latency. It should be noted that response speed was generally correlated with confidence, consistent with previous findings (e.g., Koriat et al., 2006; Robinson et al., 1997). Similar analyses to those of confidence were conducted for response latency. The pattern depicted in Figure 8.2B was largely obtained for response speed. Response speed increased monotonically with item consensus:  The correlation between mean latency and item consensus was –.42 across the 105 items, p