Ten Lectures on the Representation of Events in Language, Perception, Memory, and Action Control [1 ed.] 9789004395169, 9789004394995

In this series of lectures, Jeffrey M. Zacks offers an account of event representations in language, perception and memo

166 61 771KB

English Pages 201 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Ten Lectures on the Representation of Events in Language, Perception, Memory, and Action Control [1 ed.]
 9789004395169, 9789004394995

Citation preview

Ten Lectures on the Representation of Events in Language, Perception, Memory, and Action Control

Distinguished Lectures in Cognitive Linguistics Edited by Fuyin (Thomas) Li (Beihang University, Beijing) Guest Editors Mengmin Xu, Hongxia Jia and Na Liu (Beihang University) Editorial Assistants Jing Du, Na Liu and Cuiying Zhang (doctoral students at Beihang University) Editorial Board Jürgen Bohnemeyer (State University of New York at Buffalo) – Alan Cienki (Vrije Universiteit (VU) in Amsterdam, The Netherlands and Moscow State Linguistic University, Russia) – William Croft (University of New Mexico, USA) – Ewa Dąbrowska (Northumbria University, UK) – Gilles Fauconnier (University of California at San Diego, USA) – Dirk Geeraerts (University of Leuven, Belgium) – Nikolas Gisborne (The University of Edinburgh, UK) – Cliff Goddard (Griffith University, Australia) – Stefan Th. Gries (University of California in Santa Barbara, USA) – Laura A. Janda (University of Tromsø, Norway) – Zoltán Kövecses (Eötvös Loránd University, Hungary) – George Lakoff (University of California at Berkeley, USA) – Ronald W. Langacker (University of California at San Diego, USA) – Chris Sinha (Hunan University, China) – Leonard Talmy (State University of New York at Buffalo, USA) – John R. Taylor (University of Otago, New Zealand) – Mark Turner (Case Western Reserve University, USA) – Sherman Wilcox (University of New Mexico, USA) – Phillip Wolff (Emory University, USA) Jeffrey M. Zacks (Washington University in Saint Louis, USA) Distinguished Lectures in Cognitive Linguistics publishes the keynote lectures series given by prominent international scholars at the China International Forum on Cognitive Linguistics since 2004. Each volume contains the transcripts of 10 lectures under one theme given by an acknowledged expert on a subject and readers have access to the audio recordings of the lectures through links in the e-book and QR codes in the printed volume. This series provides a unique course on the broad subject of Cognitive Linguistics. Speakers include George Lakoff, Ronald Langacker, Leonard Talmy, Laura Janda, Dirk Geeraerts, Ewa Dąbrowska and many others.

The titles published in this series are listed at brill.com/dlcl

Ten Lectures on the Representation of Events in Language, Perception, Memory, and Action Control By

Jeffrey M. Zacks

LEIDEN | BOSTON

Library of Congress Cataloging-in-Publication Data Names: Zacks, Jeffrey M., author. Title: Ten lectures on the representation of events in language,  perception, memory, and action control / Jeffrey M. Zacks. Description: Leiden ; Boston : Brill, 2020. | Series: Distinguished  lectures in cognitive linguistics, 2468-4872 ; vol. 22 | Includes  bibliographical references. Identifiers: LCCN 2019037956 (print) | LCCN 2019037957 (ebook) |  ISBN 9789004394995 (hardback) | ISBN 9789004395169 (ebook) Subjects: LCSH: Cognition. | Perception. | Memory. Classification: LCC BF311 .Z33 2020 (print) | LCC BF311 (ebook) |  DDC 153.2—dc23 LC record available at https://lccn.loc.gov/2019037956 LC ebook record available at https://lccn.loc.gov/2019037957

Typeface for the Latin, Greek, and Cyrillic scripts: “Brill”. See and download: brill.com/brill-typeface. ISSN 2468-4872 ISBN 978-90-04-39499-5 (hardback) ISBN 978-90-04-39516-9 (e-book) Copyright 2020 by Jeffrey M. Zacks. Reproduced with kind permission from the author by Koninklijke Brill NV, Leiden, The Netherlands. Koninklijke Brill NV incorporates the imprints Brill, Brill Hes & De Graaf, Brill Nijhoff, Brill Rodopi, Brill Sense, Hotei Publishing, mentis Verlag, Verlag Ferdinand Schöningh and Wilhelm Fink Verlag. All rights reserved. No part of this publication may be reproduced, translated, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission from the publisher. Authorization to photocopy items for internal or personal use is granted by Koninklijke Brill NV provided that the appropriate fees are paid directly to The Copyright Clearance Center, 222 Rosewood Drive, Suite 910, Danvers, MA 01923, USA. Fees are subject to change. This book is printed on acid-free paper and produced in a sustainable manner.

Contents Note on Supplementary Material vii Preface by the Series Editor viii Preface by the Author x About the Author xi 1

The Importance of Events in Conception and Language 1

2

The Structure and Format of Event Representations 18

3

Event Segmentation Theory and the Segmentation of Visual Events 38

4

The Segmentation of Narrative Events 55

5

Neural Correlates of Event Segmentation 73

6

Prediction in Event Comprehension 91

7

Updating Event Models 107

8

The Event Horizon Model and Long-Term Memory 125

9

Event Cognition in Aging and Early Alzheimer’s Disease 141

10

Event Representations from Cinema and Narrative Fiction 161 Bibliography 177 About the Series Editor 185 Websites for Cognitive Linguistics and CIFCL Speakers 186

Note on Supplementary Material All original audio-recordings and other supplementary material, such as handouts and PowerPoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via a QR code for the print version of this book. In the e-book both the QR code and dynamic links will be available which can be accessed by a mouse-click. The material can be accessed on figshare.com through a PC internet browser or via mobile devices such as a smartphone or tablet. To listen to the audio recording on hand-held devices, the QR code that appears at the beginning of each chapter should be scanned with a smart phone or tablet. A QR reader/ scanner and audio player should be installed on these devices. Alternatively, for the e-book version, one can simply click on the QR code provided to be redirected to the appropriate website. This book has been made with the intent that the book and the audio are both available and usable as separate entities. Both are complemented by the availability of the actual files of the presentations and material provided as hand-outs at the time these lectures were given. All rights and permission remain with the authors of the respective works, the audio-recording and supplementary material are made available in Open Access via a CC-BY-NC license and are reproduced with kind permission from the authors. The recordings are courtesy of the China International Forum on Cognitive Linguistics (http:// cifcl.buaa.edu.cn/), funded by the Beihang University Grant for International Outstanding Scholars.

The complete collection of lectures by Jeffrey M. Zacks can be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.c .4585811.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_001

Preface by the Series Editor The present text, entitled Ten Lectures on the Representation of Events in Language, Perception, Memory, and Action Control by Jeffrey M. Zacks, is a transcribed version of the lectures given by Professor Zacks in December 2017 as the forum speaker for the 17th China International Forum on Cognitive Linguistics. The China International Forum on Cognitive Linguistics (http://cifcl.buaa .edu.cn/) provides a forum for eminent international scholars to give lectures on their original contributions to the field of cognitive linguistics. It is a continuing program organized by several prestigious universities in Beijing. The following is a list of organizers for CIFCL 17. Organizer: Fuyin (Thomas) Li: PhD/Professor, Beihang University Co-organizers: Yihong Gao: PhD/Professor, Peking University Baohui Shi: PhD/Professor, Beijing Forestry University Yuan Gao: PhD/Professor, University of Chinese Academy of Sciences Cong Tian: PhD, Capital Normal University The text is published, accompanied by its audio disc counterpart, as one of the Distinguished Lectures in Cognitive Linguistics. The transcriptions of the video, proofreading of the text and publication of the work in its present book form have involved many people’s strenuous efforts. The initial transcripts were completed by Lin Yu, Jinmei Li, Jing Du, Mengmin Xu, Yangrui Zhang, Ning Guo, Shu Qi, Shan Zuo and Guannan Zhao. Na Liu and Hongxia Jia made revisions to the whole text. We editors then made word-by-word and line-byline revisions. To improve the readability of the text, we have deleted the false starts, repetitions, fillers like now, so, you know, OK, and so on, again, of course, if you like, sort of, etc. Occasionally, the written version needs an additional word to be clear, a word that was not actually spoken in the lecture. We have added such words within single brackets […]. To make the written version readable, even without watching the film, we’ve added a few “stage directions”, in italics also within single brackets: […]. These describes what the speaker was doing, such as pointing at a slide, showing an object, etc. Professor Zacks made

Preface by the Series Editor

ix

final revisions to the transcriptions; the published version is the final version approved by the speaker. Thomas Fuyin Li

Beihang University [email protected]

Mengmin Xu

Beihang University [email protected]

Preface by the Author I am deeply grateful to Professor Fuyin (Thomas) Li for inviting me to speak in the keynote role at the China International Forum on Cognitive Linguistics in December, 2017. Professor Li presented me with an enticing opportunity, but also with two scientific challenges. First, although my research program deeply engages language and my theorizing owes much to foundational work in cognitive semantics, my empirical work mostly falls into psychology and neuroscience. I hope that the explorations undertaken in my laboratory over the years may be useful to cognitive linguists; however, I was intimidated at the prospect of making the case for its relevance and significance for this audience. Second, the task of casting two decades’ worth of accreted data and theory into a coherent form was daunting. When it came time to give the lectures, the supportive and engaged group at the Forum was a real treat. My sincere hope is that these pages and the accompanying video lectures will be helpful to at least some linguists, computer scientists, education researchers, psychologists and neuroscientists in thinking about the role of events in human experience. I would like to thank Professor Li for his invitation and for his tireless and selfless efforts organizing the Forum. I would also like to extend my warm regards and thanks to the organizing team: Jing (Milly) Du, Ning (Barry) Guo, Zhiyong Hu, Hongxia (Melody) Jia, Jinmei (Catherine) Li, Na (Selina) Liu, Siqing (Margaret) Ma, Shu (Viola) Qi, Yu (Carl) Shen, Lin (Joyce) Yu, Mengmin (Amy) Xu, Yangrui (Toni) Zhang, Guannan (Vivian) Zhao, Xiaoran (Kara) Zhou, Shan (Amanda) Zuo. Thomas, Amy, and Milly were my primary contacts and were thoughtful and gracious hosts. However, the whole team was incredibly warm, considerate, and generous with the time they shared with my family and me. Apropos of my family, I would last like to thank Leslie, Jonah, and Delia for joining me on this adventure, and on all the others. Finally, I would like to gratefully acknowledge the financial support of the US National Institutes of Health, the US National Science Foundation, the James S. McDonnell Foundation, the US Defense Advanced Research Projects Agency, and the US Office of Naval Research. In particular, during the time of this lecture my laboratory was supported by NIH grant R21AG05231401 and ONR grant N00014-17-1-2961. Jeffrey M. Zacks

Washington University in Saint Louis June 2018

About the Author Jeffrey M. Zacks received his bachelor’s degree in Cognitive Science from Yale University and his PhD in Cognitive Psychology from Stanford University in 1999. He is Professor and Associate Chair of Psychological & Brain Sciences, and Professor of Radiology, at Washington University in Saint Louis, where he directs the Dynamic Cognition Laboratory (dcl.wustl.edu). Zacks’ research interests lie in perception, cognition, and language in complex, naturalistic domains, including event cognition, spatial reasoning, film perception, and narrative comprehension. His research takes a lifespan development approach and compares the cognitive functioning of healthy adults to that of people with neurological and psychiatric disorders. His laboratory uses behavioral experiments, functional MRI, computational modeling, and eye tracking in these investigations. Research in the Dynamic Cognition Laboratory has been funded by the Defense Advanced Research Projects Agency, the Office of Naval Research, the National Science Foundation, the National Institutes of Health, and the James S. McDonnell Foundation, whose support is gratefully acknowledged. Professor Zacks has served as Associate Editor of the journals Cognition, Cognitive Research: Principles & Implications, and Collabra, and as Chair of the governing board of the Psychonomic Society, the leading association of experimental psychologists, and as Chair of the Board of Scientific Affairs of the American Psychological Association. He is the recipient of scientific awards from the American Psychological Association and the American Psychological Foundation and is a fellow of the American Association for the Advancement of Science, the Association for Psychological Science, the Midwest Psychological Association, and the Society of Experimental Psy­ chologists. Zacks is the author of two books, Flicker: Your Brain on Movies and Event Cognition (with G. A. Radvansky), and co-editor of Understanding Events (with Thomas F. Shipley) and Representations in Mind and World (with Holly A. Taylor). He has published more than 90 journal articles and also has written for Salon, Aeon, and The New York Times.

Lecture 1

The Importance of Events in Conception and Language Thank you Amy, thank you Thomas. It’s an honor to be here. I’ve been thrilled by the reception and I’m very honored to have this time to share with you some of my current thinking about how events are represented in the mind, in our cognitive artifacts, and in our language. I’d like to start by acknowledging the organizing team, who have been truly wonderful in hosting us. And I’d also be remiss if I didn’t acknowledge the agencies who founded the research I’m gonna to describe. These include the Defense Advanced Research Projects Agency, the James S. McDonnell Foundation, the National Institutes of Health, the National Science Foundation and the Office of Naval Research. I also would be highly remiss if I didn’t acknowledge my collaborators on this research. I’ve included the references on the slides, and I won’t stop to name the collaborators on each project. But I should emphasize that this work involves teams of people working on related shared interest, and a lot of it would not be possible without the people that I work with. These include students and post-doctoral fellows and collaborators at Washington University and other institutions. And I also need to thank you so much for taking this time with me. As I know it’s a big commitment to spend a week on a series of lectures and as well as your own presentations. Thank you for taking this time. Let me say a few words about the overall plans for these five days. Today’s lecture is called “The Importance of Events”, and what I hope to do today is to describe the scope of the problem that we’ve been interested in in my laboratory and try to make a case that this is a generally interesting, important problem and to give a sense of its shape. I realize that I am coming from a more

All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982398.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_002

2

Lecture 1

psychological, more neuroscientific perspective than many of you who may be more interested in language. And so I want to particularly try to point out how some of the nonlinguistic features of event cognition may be of interest to those whose primary interests are in language. This afternoon’s lecture will be about the structure and format of event representations—diving into some of the general principles of how events are represented in the mind that are going to be through the lectures. If you are not able to make all the lectures, I would encourage you to try to attend this one—the third lecture—because I’ll lay out a general theoretical framework for how to think about the structure of events and time that I will use in some of the later lectures. In the fourth lecture, I’ll apply that framework specifically to the segmentation of narrative events. In the fifth lecture, we’ll talk in some more detail about how events are represented in the brain and particularly how the temporal structure is represented. In the sixth lecture, I’ll talk about the role of prediction, about anticipating the future, and its role in ongoing cognition and in language comprehension. In the seventh lecture, I’ll talk about what we usually refer to as working memory, the maintenance of information about events online, and how it’s updated. In the eighth lecture, I’ll present a model of retrieval from long-term memory, and describe how it plays an important role in the comprehension of narrative text. In the ninth lecture, I’ll apply these concepts to studying how perception memory in language change as we get older and if we come down of Alzheimer’s disease. And in the last lecture, hopefully we’ll have a little bit of fun. I’ll talk about how event representations play an important role in the construction and perception of movies and narrative fiction. That’s the plan for the week. Let me just start with an example of a piece of narrative text. This is from J. R. R. Tolkien’s The Hobbit. And first of all I will just read it out. It says, Gandalf sat at the head of the party with the thirteen dwarves all round; and Bilbo sat on a stool at the fireside, nibbling at a biscuit (his appetite was quite taken away), and trying to look as if this was all perfectly ordinary and not in the least an adventure. The dwarves ate and ate, and talked and talked, and time got on. At last they pushed their chairs back, and Bilbo made a move to collect the plates and glasses.

The Importance of Events in Conception and Language

3

These are four sentences. One thing that we could do when we look at these four sentences is count the number of event descriptions in there. So, “sat at the head of the party”, most of us would agree it is an event. This second one, you might not agree as an event. You might say it’s a situation but maybe not an event. So “the thirteen dwarves all round”, that’s iffy if it is one. But “sat on the stool” is certainly an event, “nibbling at a biscuit”—that’s an event. “Appetite was quite taken away, trying to look” this was all perfectly ordinary. “Adventure, ate and ate, talked and talked, time got on, at last they pushed their chairs back, made a move, collect the plates and glasses”, so in the span of four sentences, we’ve got either fifteen or sixteen event descriptions. And what that tells me is that a lot of what we talk about, a lot of what our language is built to represent, is events. Another way that we can come at the problem is by looking at the building blocks of our language. So, here’s a list of a hundred most frequent verbs in English, and one thing that is interesting is when we look at this list, things get complicated. So, the very first one is “be”. Is “being” an event? In European philosophy, this is a question that vexed philosophers for hundreds of years, so we could argue about it. I’m not going to have an answer for that one in these lectures, but many of these are events, but interestingly they’re not necessarily the sorts of physical actions that I illustrated in The Hobbit. So, “do” is a physical event, “say”, “go”, “get”. But many of them are mental, like “know”, “think”, “look”, “want”. So, there’s a mix of event categories that are represented by the common verbs. And this tells us that the relationship between events in perception and our use of language is going to be complicated, so there’s something to look at here. Speaking of philosophers, William James was a brilliant one. James was one of the great American philosophers of the twentieth century, and also was the founder of the first psychological laboratory in my country. If you want to find a description of a problem or a solution in psychology, it’s always good to look at James—he usually got there before the rest of us. He made a really important point about how our mind represent events that’s illustrated by this passage. It’s a little bit long, but let me read it out. We see that the mind is at every stage a theatre of simultaneous possibilities. Consciousness consists in the comparison of these with each other, the selection of some, and the suppression of the rest by the reinforcing and inhibiting agency of attention. The highest and most elaborated mental products are filtered from the data chosen by the faculty next beneath, out of the mass offered by the faculty below that, which mass in

4

Lecture 1

turn was sifted from a still larger amount of yet simpler material, and so on. The mind, in short, works on the data it receives very much as a sculptor works on his block of stone. In a sense, the statue stood there from eternity. But there were a thousand different ones beside it, and the sculptor alone is to thank for having extricated this one from the rest. Just so the world of each of us, howsoever different our several views of it may be, all lay embedded in the primordial chaos of sensations, which gave the mere matter to the thought of all of us indifferently. We may, if we like, by our reasonings unwind things back to that black and jointless continuity of space and moving clouds of swarming atoms which science calls the only real world. But all the while the world we live in feeling will be that which our ancestors and we, by slowly cumulative strokes of choice, have extricated out of this, like sculptors, by simply rejecting certain portions of the given stuff. Other sculptors, other statues from the same stone! Other minds, other worlds from the same monotonous and inexpressive chaos! James (1890) describes how, through a succession of abstractions and refinements and recomputations, we construct a world. It is not just given to us. The events that we experience and that we represent in our language are not just out there in the world. They are a product of the interaction between the world as it’s given to us and our mental faculties. This is a view of course that goes back to the philosopher Immanuel Kant. I want to adduce from these initial examples three points. First is that events are the stuff of our lives, it’s worth taking the time to investigate how we represent them in our minds, because they are much of what we talk about and much of what we experience and think about. [[Second,]] they are a lot of what we talk about. And [[third,]] they may not be simply given—rather than they emerge from the interaction between our worlds and our minds. OK, so with that motivation on the table, I’d like to turn to some philosophical concerns that underpinned work on psychology and linguistics of events. Some of the kinds of questions that philosophers have been concerned with include: How do we figure out what the events are? Is there a well-defined class of events that we can individuate and call out and pick out from the others? And if so, what are the criteria by which we individuate one event from the mass of experience? Are events bounded in space-time? Can you say where one event ends and where another one begins in space and in time? And if so, is it necessary that for something to be an event that it be contiguous in space-time? Can I have an event that occurs in multiple places at the same time? So, when we celebrate the turning over of the calendar in a few days,

The Importance of Events in Conception and Language

5

[[it]] is New Year’s Eve around the world (it will take place fourteen hours later in the United States where I come from) one event though it takes place at multiple places? Is this series of lectures one event though it’s interrupted? Are events logical particulars like objects? Are they things that we can predicate attributes of? Are they things of which we can say, “this event has a particular property?” One answer to some of these questions to say, “Yes, events are logical particulars.” This was Davidson’s favored point of view. So, for example, if we say something in natural language like “Phillip discussed causality”, Davidson would argue that what we mean by that was there was some event that we could call it X, such that it’s true that “discussed” with the arguments “Phillip” and “causality” is true of X. So, this quote [[“∃x Discussed(Phillip, causality, x)”]] is saying “There exists some X such that ‘discussed, Phillip, causality, X’”. What this entails is that events have the same ontological status as objects. In the same way that we can form criteria to discriminate—to identify one coffee cup from another coffee cup—we can form criteria to identify one object from another object. There’s a really extreme version of this that was posited by Willard Van Orman Quine, who said that maybe events just are the same things as objects, namely bounded regions of space-time. So, for example, if we consider “a birthday cake”. A birthday cake has a certain spatial extent and it comes into being at some point in time, and it goes out of being at some other point time, hopefully in the middle of a birthday party. But, similarly, you could argue the birthday party itself is a bounded region of space-time. It begins maybe when the guests arrive and ends when they leave, and it has a temporal extent and also a spatial extent, it occurs in a particular room. And on Quine’s view, the only real difference is a matter of attention and this is highlighted in language. Event language forces us to focus on the temporal aspect of the extent of the thing whereas object-language focuses us on the spatial dimension. Not everybody is happy with this way of thinking about events as logical particulars. Philosophers like Jaegwon Kim pointed to examples like this one: Imagine that there is a ball that is increasing in temperature and at the same time spinning. It seems intuitive to say that there are two events there. The “events as particulars” view has to say that there is one event in which the ball is both heating and spinning, but if you stop someone on the street and ask them, they might feel like there is one event of the ball heating and there is another event of the ball spinning, they just happen to take place in the same region of space at the same time. So, Kim says that the Davidsonian account seems wrong, and proposes an alternative that what an event is: It is exemplifying some property at some particular time. That’s just giving a

6

Lecture 1

formal grounding to the intuition that this is one event in which the property is heating, and this is another event in which the property is spinning, and they happen to have the same space and time. But their exemplifications are of different properties. Now, one issue with this approach is that there seems to be an indefinite— indeed unbounded—number of properties that we could pick out at any given time at any given location, and so the number of events that might be individuated grows unacceptably quickly. A view that is more related to the Davidsonian view is the situation semantics theory developed by John Perry and John Barwise (1983). Perry and Barwise proposed that situations are individuated not by simple logical predication forms, but by a structured predication. To specify an event, you have to specify a minimal number of properties that include the individuals that are present, the properties of those individuals, and a set of event states. They distinguish between two kinds of event states: there are states of affairs—configurations of the world—and also courses-of-events, which are sequences of states of affairs. So, the representation of an event on this account consists of all of these things, and also temporal locations. In other words, to individuate an event, you have to specify this structured kind of representation. This philosophical view is particularly interesting because it was designed to rationalize our intuitive psychology about events with a philosophically consistent framework, in a way that could be used to actually do computation. Given this collection of ways of thinking about events in philosophy, which ones are going to turn out to be useful for cognitive science, for linguistics, and psychology? As I said, the property exemplification view is really problematic because it posits this large indeed unbounded number of events corresponding to an everyday scene. So, if we observe a child going down a slide in a park, there’s an event of sliding, there’s an event of the slide heating up minimally, there’s an event of one of the parents walking from point A to point B, there’s another event of another parent walking from point C and to point D—there is an indefinite number of properties being exemplified at that point, and discriminating what is relevant to that particular spatiotemporal extent is not solved by that theory. So, thinking about how the brain or the mind could represent events in this sense is really challenging. Quine’s events-as-objects view is the most straightforward, but it provides little room for psychology and it doesn’t seem to capture our intuitions that events in fact are not the same things as objects. So, this leaves situation semantics, which is more complex but more powerful and is going to turn out to correspond more closely to the psychological

The Importance of Events in Conception and Language

7

accounts of event representation that have been successful, but I think that both of [the first two] have important things to offer. With these philosophical accounts in mind, Barbara Tversky & Zacks (2001) proposed this definition of events for psychology: “An event is a segment of time at a given location that is conceived by an observer to have a beginning and an end”. This is closely related to Quine’s proposal, but it’s psychologized rather than referring to the things out in the world with James [[that]] we’re going to cast in terms of the constructs of the mind. This is the view that I’m going to be working with through most of the lectures. I want to acknowledge important alternatives in the history of the psychology of events. The psychologist James J. Gibson had a very different view of events in psychology. Rather than being things that are conceived constructed by the mind, for Gibson an event was a class of stimuli that is defined by an invariant structure in the environment that persists for some time. So, a classic case for Gibson was “the dropping of a ball”. If you drop a ball, then the ball falls and bounces and bounces, and the bounce becomes smaller until it comes to rest. And Gibson would say that through the duration of the bouncing is from when it falls to when it comes to rest. Though the trajectory of the ball is constantly changing and it is constantly subject to acceleration, it is governed by one dynamics which is the dynamics of acceleration in collision under gravity, so as long as that invariant structure persists, that’s an event. Gibson distinguished a bunch of invariant structures that could produce events in the world—things like changes in layout, changes in color and texture, changes in surface existence. Let me give more detailed examples. The kinds of changes in layout that are relevant to humans are things like rigid translation and rotation of an object (something falling or turning or rolling, collisions of objects, rebounding, joining), nonrigid deformations of an object (for example, as when animals locomote and when things like flags blow in the wind), surface deformations (things like waves and flow), surface disruptions (things like puncturing or explosion); changes in color and texture (including things like greening or turning brown of leaves, ripening, flowering), changes in animal surfaces (camouflage, changes of plumage, changes of fur) and terrestrial surfaces (weathering of rock, blackening of wood, rust); changes in the existence of surfaces (so, things can melt, dissolve, sublimate, precipitate), disintegration, biological decay, destruction, aggregation, biological growth, and then construction, the emergence of new surfaces. For Gibson, events are things that are given out in the world, contrary to what I attributed to James. And what makes them relevant for psychology is

8

Lecture 1

that they are the kinds of things that biological organisms such as ourselves are attuned to in regulating our behavior. The Gibsonian approach, I want to acknowledge, offers many insights into how people respond quickly to simple features of their environment. The Gibsonian classification of events, and individuation of events in terms of invariant dynamics gives a great account of how we do things like “dodge rapidly approaching objects” or “navigate around obstacles”. However, there’s just large swaths of experience for which this description of events gives no account whatsoever. It doesn’t say much about the inner life of the observer, it doesn’t say much about how we plan our actions for the future, and it doesn’t say much about the structure within events. So, it gives events individuated in terms of invariant dynamics, but within the dynamical structure of an event it doesn’t have anything to say about how those components relate to each other. And because of that it can’t say anything about the parts. And, last but not least, it has nothing to say about how we remember events that happened in the past, you can only tell us about how we respond in real time to simple events that happen in front of us. I believe the Gibsonian conception of events is really important for thinking about rapid interactive motor control, but [[it]] doesn’t have a lot to say about cognition and language. The last philosophical topic that I want to treat is the relationship between actions and events. Actions are obviously a major component of events. When I gave that list of events in The Hobbit, most of them turn out to be actions. And this raises important question of how we conceptualize events, actions, and how they relate to events. One important concept proposed by Arthur Danto (1963) in the philosophy of actions is the notion of a basic action. Danto says that, “B is a basic action if and only if (i) B is an action and (ii) whenever a performs B, there’s no other action A performed by [[a such that B is caused by A]], sorry, small “a” is a person and large “A” is the name of an action. So, whenever person “a” performs action B, there is no other action A performed by person a such that B is caused by action A. That’s the definition, and intuitively it just means basic actions are the most basic intentional things that you can do—the most logically simple things that you can do on purpose. For example, lifting up your arm, that’s a basic action; moving your head, that’s a basic action. Actions such as “raising one’s arm to ask a question” or “craning to look at an approaching person” are nonbasic actions that have basic actions as their causes. In Danto’s view there is a hierarchy of actions such that we could describe a given happening. For whatever reason the philosophers prefer very morbid examples here. So, the favorite in this literature is describing an action as pulling

The Importance of Events in Conception and Language

9

a trigger, shooting, firing a gun, murdering someone, assassinating someone, that’s a hierarchy of actions where pulling the trigger is the lowest-level action and assassination is the highest-level action. So, Danto’s notion is of causal dependence such that each higher level depends on the lower level—so craning to see depends on moving your head. A related notion in psychology is taxonomic hierarchy: Craning is a kind of head motion. And orthogonal to the notion of taxonomy is the notion of partonomy. So, we can ask what are the classes to which an action belongs, and we can also ask what are the larger actions of which a given action is a part. I’m going to spend a little bit of time today on both of these. OK. So, to summarize this philosophical section, I want to say that events raise important questions about what exists (metaphysics, ontology) and also how we can know (epistemology). And both of those I think—but especially epistemological questions—are relevant for psychology and linguistics. With Davidson and Quine, it’s often helpful to treat events as things in the sense that cars and kittens are things. We can look at the common everyday events of our world much as we can look at the common everyday objects, and study how language and psychology represent them. As with other sorts of things, the internal structure of them matters a lot, and so one of the gifts of philosophy to cognitive science on this point is an attention to the internal structure of objects and events. With that in mind I want to turn to how this philosophical background has allowed us to draw lessons from the psychology of objects for thinking about the psychology of events, and the two lessons I want to draw are lessons about taxonomy—categorical relationships—and partonomy—part, sub-part relationships. Let’s start with taxonomy. We’ll have a little audience participation part now. I’m going to ask you to name these objects in English for me, and I want you to try to name them as quickly as possible when they pop up, so get ready. When I pulled these images from the internet, here are the filenames that were associated with them: flashlight, butterfly, cactus, fish. Now, notice that nobody said saguaro cactus, and nobody gave the genus of the butterfly (I couldn’t do it), or of the fish. I think there’s something very interesting there, I’ll give you one more. Name this as quickly as possible. Now things get a little bit interesting. Most people said chair, right? And this is what Eleanor Rosch called “the basic level category”. Nobody says furniture when asked to name this thing, right? Calling a picture like this furniture would be considered a joke or uncooperative. One person said armchair, but nobody identified it at the level of an Amish bentwood rocker. (These come

10

Lecture 1

from my part of the country, they’re made by hand, they steam the wood, it’s really cool.) So, we can consider hierarchy of category labels where furniture is up here, chair is intermediate, and then these subcategories like rocker or desk chair down here. Rosch called this a superordinate category [[furniture]], this [[chair]] a basic level category, and this [[Amish bentwood rocker]] a subordinate level category. What she showed is that that basic level is really quite special. As we just saw, it’s the preferred level in naming: If you just ask people to name things without much of a context, it’s the level that most people are going to provide. Basic-level categories are verified more quickly than superordinate or subordinate level categories, so if I give you a label chair and then a picture and ask you to tell me as quickly as possible, does the picture fit the label? You can do that very quickly for a chair, but more slowly for furniture or for Amish bentwood rocker. If you ask people to list what are the features of chairs, they can list lots of features, but if you ask people to list the features of rocking chairs, they can only produce a couple of more features. And if you ask people to list the features of furniture, there are very few features that all furniture has in common. There’s a big jump in the number of consistent features when you go from superordinate to basic. And then there’s only a small further increase when you go down to subordinate. If you average together the silhouettes of lots of chairs, you’ll get something that looks kind of like a chair. If you average together the silhouettes of lots of furniture, you’ll just get a mess. And this means that they’re faster to be named when people are asked to name things from pictures. If you look at the kinds of motor actions that are directed toward an object, you’ll find the kinds of things that you do with a chair are very consistent whether it’s a rocking chair or a desk chair or a dining chair. But the kinds of movements that you direct towards furniture are not at all consistent. And [[basic-level objects]] have more common parts. Most chairs will have legs, a seat, and a back, but there are rarely common parts that all furniture share. If you look at the development of categories, children learn what chairs are much more earlier than they learned what rocking chairs are or what furniture is. In our conceptual system there is a hierarchy with a middle ground that corresponds to the kind of things of everyday life, chairs and birds and fish are the things that we interact with. Whereas these more specific or more general categories are things that we generally tend to think about more deliberatively. Now, it’s important to know that what is a basic level category depends a bit on one’s experience. For most of the people that I spend time with, bird and

The Importance of Events in Conception and Language

11

tiger are basic-level terms, whereas the different types of birds are subordinate level categories. But the people I spend time with are mostly bird novices. If you spend time with a bird watcher, then bird is a superordinate category, and they’re very very fluent and rapid at identifying and conceptualizing things at the level of “finch” or “eagle”, and then they have a bunch of subordinate categories for the different kinds of finches of which I’m completely ignorant. So, part of what makes something basic level is your experience, and those who of us who live in cities tend to be novices when it comes to natural kind categories, categories for animals and plants. People who live in cultures that are closer to nature often have much more elaborate category systems for those natural kinds. Again in English, I’d like to ask you to name as quickly as possible the animation that I’m about to show you. OK, So, get ready, it’s going to be very brief and try and tell me as quickly as possible what you see. [[Audience]]: Move. [[Zacks]]: Move?! Oh, that’s interesting, that’s a superordinate. Look here, I’ll show it one more time. [[Audience]]: Kick. [[Zacks]]: OK, so most people say “kick”. Paul Hemeren (1997) showed that just like people can identify objects more quickly at an intermediate level. We can identify actions more quickly at an intermediate level. So, somebody did say move, which is cool. Nobody said karate kick. Maybe we would have to be in Japan. But again, we’ve got a superordinate, a basic, and a subordinate level categorization. And just like with objects, basic level events seem to be privileged. So, these are the levels of description, the labels that we tend—both in terms of nouns and verbs—to use in normal conversation about actions. Again, like objects they have more information about common features. If I ask you what are the features of movement, it’s hard to list features. If I ask you what are the features of a kick or a throw, it’s a lot easier, right? In a kick you move your leg and the leg usually goes out and or up. If I ask you to list the additional features of a karate kick, it’s hard to articulate what they are. That’s very much like objects. Basic level event categories are judged to be similar and they can be verified very quickly. Here’s just one example of what we can get very quickly from basic level events. Alon Hafri and his colleagues (2012) ran an experiment in which they had people viewing still pictures of simple actions, and the experimental paradigm was that they saw a fixation cross and watched this: There was two hundred milliseconds of blank screen, and then a target image that was 37 or 73 milliseconds followed by a bunch of visual noise to wipe out the image so your brain

12

Lecture 1

didn’t continue to process it. And then people were asked to verify whether they saw a particular action in this thing. And then after 30 milliseconds of processing their visual systems were able to verify punching quite easily. They could also figure out whether it was the male or the female that was doing the punching. And what they found is that there are characteristics features even without movement of the poses of the bodies that allow you to determine which is the agent [[and]] which is the patient in this action. One set of lessons from objects is that they have a taxonomic organization with a privileged basic level that we have rapid and fluent access to. The other lesson that I want to draw from objects is about partonomy—the parts of objects. And here I want to construe objects a little bit more broadly to include things like scenes. Here’s a picture of a scene that we’ve chopped up a little bit. I apologize for the low quality of the images. You can probably tell even with the poor quality and the dismemberment of this image that it’s a picture of an office scene, right? But if we chop it at the boundaries of the objects, it’s much easier to identify what it is. Here’s another example. This is easily identified as the parts of rhinoceros. But if those parts are chopped at their joints, then it’s easier to do so. So, visual objects have natural parts that observers can easily identify and can agree on what those parts are. If we were to show a simple abstract shape like this to a large group of observers, and asked how many parts there are, how many parts are there in this? Three, most people would say three, and they would say that the boundaries of the parts are right here, there’s one here and then there’s a big one. Now, researchers in visual object recognition in computer science have argued that the parts that our visual system is built to abstract are those parts that would occur from the interpenetration of objects. So, if I make a form by sticking together three parts like this, then I will get these characteristic junctures where there are points of maximal local curvature. If you ask how fast is the outline turning, these are the points where the outline is turning the fastest, and reliably, if you jam two things together, you will get points of maximal local curvature from almost any view point at the points where they interpenetrate. You can see this in things like the silhouette of a face. The points of maximum local curvature illustrated here nicely separate the nose from the upper lip from the lower lip from the chin. It turns out that these points of maximal local curvature are very valuable for individuating what the objects are.

The Importance of Events in Conception and Language

13

This is a classic experiment by Irving Biederman (1987), in which he showed people pictures, line drawings from which he deleted half of the outline. Both of these have fifty percent of the contour deleted. How many of you think that this [[left]] would be the easier one to recognize? How many of you think that this [[right]] would be easier one to recognize? Absolutely, and that’s exactly what they found. This one [[right]] preserves all of the contour discontinuities, this one [[left]] deletes all the contour discontinuities, so this [[right]] gives you a lot more information about what the parts are. Scenes and objects are very rapidly perceived as consisting of parts, and shape features—namely maxima local curvature—give reliable cues to the functional partonomy of both objects and scenes. And probably for this reason people seem to use them habitually in perception. So, the question that we and others have asked is, “Does the same thing holds for events?” One important thing to know about the partonomic structure of objects is that it’s hierarchical. Marr and Nishihara (1978), in building their model of visual object recognition, noted that most objects consist of parts which themselves consist of parts which themselves consist of parts. So, if you think about a human body, you can think of it roughly as consisting of a head in the torso, and arms, and legs, the arm breaks down into an upper arm and a lower arm, the lower arm or forearm has the arm and then the hand. Similarly, the kinds of events that we experience break down into parts and sub-parts, so if one eats out at a restaurant, it might consist of a set of parts like entering the restaurant, ordering, eating, and leaving. Ordering itself is going to consist of parts such as bringing the menu and placing the order, placing the order itself is going to consist of parts like cruising the menu and speaking the order, and so on. A natural question to ask about these events parts is, can they be individuated based on the same kinds of discontinuity constraints that govern shape perception? Mandy Maguire and Tim Shipley with their colleagues (2011) did this really neat experiment where they asked people to look at objects such as these, or to view an animation of a point traversing these contours, moving along this contour. And then they ask them to identify where the parts of the objects or the events were. And we can view their data by superimposing a graph on the contours themselves. What they are doing here is plotting the number of people who said this is a boundary with the length of these whiskers. And what you can see is that for both the trajectories (viewed as an animation of a point) and for the contours people break them into parts at these points of maxima local curvature.

14

Lecture 1

I want to particularly call attention to this extreme point here. It’s so extreme that it’s a little bit hard to see the edge of the contour is here, and so this is all the people that identified it as a boundary and similarly here. But all of these extrema in curvature people say are boundaries both for objects and for events. They then ran a second experiment in which they put a camera in a tall building and filmed a person walking a field. Just as for the simple point trajectories, when the person had a maximally curving trajectory, that was viewed as an event boundary. The boundaries of events, like the boundaries of objects, can be ascertained at least for motion events from points of maximal local curvature. I wanted to pause for a moment to point out that it’s not always easy to identify the boundaries of events, and I want to say that that’s not a problem. If we ask when this lecture began, well, one answer might be when the first person arrived, another one might be when my first slide went up on the screen, or when we all quieted down before Amy and Thomas introduced me. There is a room for disagreement about exactly where it began. We could ask when did the Internet Age began, so you could pick a landmark—for example, at midnight on January 1st, 1983, the ARPANET adopted the TCP/IP Protocol that we still live with today, and you could say that was the beginning of the Internet age. But, you know, ARPANET was proposed in the 60s, it was initially implemented in 1969, so maybe that’s the beginning of the Internet age. The scientist and policymaker Vannevar Bush, back in the 1940s, imagined a system in which one could sit at a device, and call up information from all over the world. So maybe that’s the beginning of the Internet age. On the other hand, even though we had the ARPANET with TCP/IP in 1983, the World Wide Web didn’t launch until 1990. (I remember so vividly when it did. I was working in a telecommunications lab at the time they asked me to look into this new thing called the World Wide Web, and I wrote a report saying “You know, there’s not much there, it’s only about twenty pages, and if there were more there you would be able to find any of it anyway.” Look how far we’ve come since then.) So, when did the Internet age begin? Who knows exactly? How can we identify what are the beginnings and endings of events with any certainty? I want to say that even though this is true, this is totally normal, and we have a similar problem for objects. And in fact, if you don’t look too hard for both events and objects, most of the time it’s pretty easy [[to identify boundaries well enough to get by]]. Let me trace out the example for objects. This is a map of St Louis, Missouri, and you can see the intersection of the Mississippi and Missouri rivers coming

The Importance of Events in Conception and Language

15

together to form the lower Mississippi. And in a map like this, the boundaries appear completely crisp and determined. If you look at a satellite picture, you’ll start to see that it’s a little bit harder to see exactly where the edges of the river are. If you zoom in, you can start to see some real challenging cases. So, this is zooming in on the location where the Missouri River and the Mississippi River join, and the Missouri River, you can see with good reason, is called the Big Muddy. It’s this browner water joining the less brown water of the Mississippi. But where exactly is the boundary of the Missouri where it runs into the Mississippi? This is where we mark the confluence of the two rivers on maps. But you can see that the water stays pretty well separated for some distance downstream, so is this part of the Mississippi or the Missouri? Well, it’s a matter of judgment. And worse yet, this is in 1993 when we had major flooding in St Louis. So, this is before the flood, this is at the peak of the flood. So, where is the boundary of the river, you know, is it here where the water got to on the higher water mark which is dry most of the time? This turns out to be of practical significance. So, for example, in Missouri, navigable waterways are public property, so anybody can paddle a canoe down the Missouri or the Mississippi. But does that mean that I can stand here in the middle of a farmer’s field when it’s dry and that’s part of my entitlement because it’s part of the river at some point? These are actually things that people have lawsuits over. So, the boundaries of objects in the world are often fuzzy and likewise the boundaries of events are fuzzy, and I don’t think this is a problem for either. Most of the time we can individuate the objects and the events well enough to get by. An important question about the parts of events is: “Do they have, in this hierarchical organization, a basic level like [[what]] we saw for objects?” Roger Barker and his colleagues (1954) said “Yes”. [[They]] said that there’s something like a basic level. They didn’t use that term, but they said this special level at which we think about objects habitually. They call this the behavior episode. They said “behavior episodes are analogous to physical objects which can be seen with the naked eye. They are the common things of behavior; they correspond to the stones, chairs, and buildings of the physical world”. And they proposed a specific set of criteria that individuate behavior episodes. (1) A change in the “sphere” of the behavior between verbal, social, and intellectual. So, if we start a conversation, even though it’s the same people standing in the same place to answer, that conversation would be a change in the verbal, social and intellectual sphere, so that would initiate a new behavior episode. (2) A change in the predominant part of the body. (3) A change in the

16

Lecture 1

physical direction of the behavior. (4) A change in the object towards which the behavior is directed. (5) A change in the behavior setting the location. (6) A change in the tempo of the activity. They argued that any of these things can give rise to a behavior episode in the mind of the observer. One example they give is of a child walking to school. We can think about this on a continuum of timescales. At this shortest level, the child is stepping down from the curb; at the next level up, the child is crossing the street; at the next level up, the child is walking to school; at the next level up, the child is working to pass from the third grade into the fourth grade; and then next level up, the child is getting an education; and at the next level up, the child is climbing to the top in life. There is a kind of transition as you go up these levels increasing in timescale from more physically determined features that are definitional to more thematic determinations. And I would argue that something like these bottom three levels correspond to a basic level in the partonomy of an event. We can break things down even more finely. You know we could say “lifting the leg, moving the leg forward, lowering the leg”, but we don’t talk about being events in language at that level of description. More broadly, what seems to come out in our work is that there’s a range of timescales from about a second or so to tens of minutes, at which people easily label and describe and remember events. If you are trying to go much faster than that things the system breaks down a little bit, and if you try to go much longer than that the system breaks down a little bit. There’s probably a range of behavior episodes from like I said about a second or so to tens of minutes. And if we think about events that are much faster such as the decay of a subatomic particle, or much slower such as the motion of continents or the heat death of the universe. We think about those things by analogy. We run a little animation in our head that renders them into the time scale of behavior episodes. I want to say that we can draw really important lessons from the relationship between but from the analogy between objects and events, there is an important place where this analogy breaks down, and therefore some unique factors to the psychology in language about events that don’t analogize to objects, and that’s where time is concerned. Objects persist over time, but events are ephemeral and unique. There’s a Greek philosopher, Heraclitus, who said you can never step in this same stream twice, and that’s a point about the nature of our experience of events. I can get up each morning and identify something either in terms of its category—oh, that’s a mug—or I can identify it as a particular mug—that’s Leslie’s mug. But for events, I can categorize an event as an instance of going to a restaurant,

The Importance of Events in Conception and Language

17

but it would never be the case that it’s the same going-to-the-restaurant that I experienced last week unless something very strange is going on in my life involving like time travel. However, I think it’s important—one of the really exciting things about living within the last hundred twenty years—is that it’s now possible to record with pretty high perceptual fidelity events. With the onset of motion picture projection, we acquired the ability to record the visual and auditory properties (to a limited but reasonably complete extent) of an event, and then play it back at will. There’s an approximation to being able to repeat the same experience that was not available in human history until that point, so that’s kind of interesting. In addition, the spatial structure of objects is basically symmetric whereas time introduces a big asymmetry into the structure of events. Identifying something as being on my left or on my right is a matter of symmetry, and if I turn around then those change, but I don’t find something is in the future versus in the past is a huge asymmetry, right? The way you’re going to treat an event that’s in the future is completely different from the way you’re going to treat an event that’s in the past. Just to recapitulate what I’ve tried to say today, I want to suggest that events are the stuff of our lives, and they are much of what we talk about in language. One lesson from the philosophy of events is that to understand how events are represented, we can look for analogies and disanalogies to objects. And, like objects, events are cognitively represented in taxonomic hierarchies. And, also like objects or scenes, events have internal structure, notably they have parts and sub-parts, and for both objects and events, there seems to be a privileged level corresponding to what we could call a basic level or behavior episode at which we have fluent and rapid access. With these considerations in mind, what I’m going to try and do in the afternoon is to present an account how events can be represented for the purposes of perception, action control, and language. So, I’ll leave off here. I think we’ll take a few minutes for break and then have some questions.

Lecture 2

The Structure and Format of Event Representations This afternoon I’d like to move from the general considerations about events in language and cognition to talking a little bit more specifically, but still at a theoretical level, about the nature of event representation, and then tomorrow we will move into a much more empirical work. So, I want you to start by thinking about the first time you went to a restaurant. Maybe it was a fancy restaurant like this one, or maybe it was more like this. But in any case, most of us, by the time we are adults, have had experience going many, many times to eat in a restaurant. The question I want to ask is how is it that humans integrate a lifetime of experience for the category of events like that, with their current inputs when they are reading a narrative text that might involve a restaurant. And relatedly, what is the same about the experience of an event, whether something that we’re actually living or something that we’re experiencing vicariously through language. And related to that is the question of how is it that events unfolding in front of our eyes are transmuted into memories. So that’s the general set of questions that I want to explore. I want to start by making some definitions. Consider a picture like this. Some of you may recognize those are my children a few years ago when we lived in England. We can contrast a representation of my children with another kind of representation. We could say “Jonah is a young boy”, “Jonah has brown hair”, “Delia is an even younger girl”, “Delia is wearing a red sweater”, “Delia is standing next to Jonah”, “Delia and Jonah are standing in front of a shrubbery” (one of those uniquely English words that we don’t use in American English, but which proper English uses). On some accounts, there is some presumably large and definite set of sentences something like this. They can capture the meaning of a picture like this. But there is some complicated relationship between these two kinds of representations [the picture and the sentences]. All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982431.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_003

The Structure and Format of Event Representations

19

The first of these is, of course, a depictive representation, whereas the second is a propositional representation. Depictive representations are isomorphic to the things that they represent. That means there is a smooth oneto-one correspondence between some dimension in the representation—in this case an image—and some dimension in the thing being represented. They are also holistic, which means you can’t decompose them into parts and have those parts still work properly. Let me say a little bit more about both of those things. Isomorphism means there is a smooth one-to-one correspondence between the representation and the things being represented. In the case of a picture, something that is to the right in the picture is also to the right in the real world. So that’s a spatial isomorphism. Something that’s brighter in the picture is more reflective of light in the real world. That’s a non-spatial but still perceptual isomorphism. Now, not every isomorphic representation needs to use the dimension that is being represented in the representation itself. So in the case of a picture, we are using space in the picture to represent space in the world. But in a famous example that Roger Shepard and Susan Chipman (1970) gave, you can think of the relationship between a key and a lock as another very different kind of isomorphism. The key doesn’t look anything like the lock, but they correspond in the right way such that if you have a small change in the spatial arrangement of the key, you need to have a similar corresponding change in the arrangement of the lock in order for them to still work together functionally. Holistic means that you can’t decompose them into parts and still function properly. Let me just make this really concrete. So you know I showed this, but what I’m really talking about is literally something like this. This is the only prop that I brought for the week. And I brought it to illustrate that when we talk about an isomorphic representation, I really mean ink and paper in a correspondence to the world. In terms of holism, even more so. If I have a picture like this, I could maybe carefully tear out the part that corresponds to Jonah’s head or better yet his arm. But I can’t move it around and have a function like an arm what’s left I carry out like a hole. I can’t manipulate the parts. Whereas a propositional representation, we will see, has parts that can be decomposed in a range. So a model case of a depictive representation is a picture like this. In contrast, propositional representations are arbitrary, rather than being isomorphic to the things represented. The relationship is arbitrary, so if we look and consider a sentence like “Jonah is a young boy”, there is nothing young or male about those words, those characters. And of course Saussure told us it was the cardinal feature of representation in language that the representation between the symbol and the thing that’s represented is an arbitrary one. Now, of course natural languages maybe are quite arbitrary but there are

20

Lecture 2

sound-to-meaning correspondences that are non-accidental. But for a long time, this was taken a core marker of a natural language, and it’s at least approximately true. Propositional representations, of course, are also componential. So I can take a word out of a sentence like boy and use it in another sentence and it functions properly, but if I rip out this piece of the picture I can’t use it in another picture. We can do this at the level of words; we can do this at the level of morphemes or phonemes, or clauses or sentences. So natural languages are componential at many levels. These attributes are what give propositional representations their generativity and expressive power, so without a flexible relationship between the representation and things being represented, and without the ability to recombine parts, you don’t get the generativity of natural languages, as has been famously described by Fodor and others. My model case of a propositional representation is a sentence. A sentence in natural language is pretty good but even better is a sentence in an artificial language, like a programming language or a propositional calculus. Those are truly propositional representations that are both arbitrary and componential. A couple [[of]] characterizations of depictive representations and propositions by others: Shepard & Cooper in their 1982 book on mental imagery wrote, “We classify an internal process as [depictive] if and only if the intermediate stages of the internal process have a demonstrable one-to-one relation to intermediate stages of the corresponding external process—if that external process were to take place” (they use of the term mental imagery, for my purposes mental imagery and depictive are equivalent). In contrast, here is Zenon Pylyshyn’s (1973) description of what propositional representation is. The first and foremost thing is [[that]] it does not correspond to a raw sensory pattern. It’s the opposite of isomorphic. It’s not different in principle from the kind of knowledge asserted by a sentence. It depends on the classification of sensory events into a finite set of concepts and relations. In Pylyshyn’s view, a propositional representation in one of those things’ criteria is that they have components that has included concepts and relations amongst concepts. There is abundant evidence for depictive representations in the nervous system. I am guessing this is material that will be familiar to many of you, but I just want to review to draw out some points about the nature of depictive representations in the brain. This is an image of the brain, the left hemisphere, taking from a medial view, highlighting the primary visual cortex. The primary visual cortex is in the very

The Structure and Format of Event Representations

21

back of the brain, folded in on the medial side. It’s the first place that visual information projecting from the retina hits the brain. Studies in nonhuman primates and in humans have provided abundant evidence that the representations in primary visual cortex are depictive— that is to say, isomorphic and holistic. Here is one great example from Roger Tootell’s group (1982). They trained monkeys to simply hold fixation on the stimulus and then presented them with a flickering bullseye pattern illustrated here. The actual stimulus alternated so that the white and black components alternated on this superimposed neutral grey background. That kind of flickering drives the visual neurons on the cortex maximally, and while the monkey looked at the stimulus, it was perfused with a radioactive glucose tracer, such that parts of the cells that were more active would take more of the tracer. After viewing the stimulus, the animals were sacrificed and primary visual cortex was dissected out, and then laid out on X ray paper so that those parts of the tissue that took up more of the tracer would expose the X ray paper more. So the monkey fixates a pattern while the tracer is perfused, and then the cortex is dissected out and applied to radiographic film. And when you do that and flatten out the tissue applied to the film, this is the pattern that you get. So this corresponds to this part of pattern. This corresponds to this part of pattern. And then you’d see a corresponding half, so this is from one hemisphere visual cortex and the other side of the image we’ve represented in other cortex of hemisphere. Clearly, there is one-to-one spatial correspondence between the elements of the response self and primary visual cortex and stimulus. Similar studies have been carried out on humans. These are two medial views of the primary visual cortex and surrounding areas. Rather than being actual radiographs, these are data from a functional MRI experiment in which people watched stimuli that either slowly expanded radially, so this ring slowly gets bigger over times and the new one starts in the middle, or watched flickering wedges that rotated clockwise. What these allow you to do is [[to]] encode either radius or direction of the stimulus relative to the visual field in time. So there is a particular point in the cycle when the centered visual field has been stimulated and then another later point when the outer part of visual field has been stimulated and then that repeats, and similarly there is a particularly point time when upper right visual quadrants being stimulated and another point time when the lower visual quadrants being stimulated and that repeats quickly. We can ask which cells are most activated when the stimulus is in this radial location or in this eccentricity. You can see a nice smooth progression. This is

22

Lecture 2

capturing the distance from the center to the periphery, and this is capturing the orientation. So put these together, and you have the information that is shown in this here. In addition to visual representations, the representations of touch and motor control have strong depictive properties. The center part of the brain, running right down the middle here, is bordered on the posterior bank by the somatosensory cortex, and on the anterior bank by the motor cortex, and in both of these primary cortical representations, there is a smooth progression with the legs represented on the medial bank of the cortex and then moving down to the hands, the face and the tongue, way down on the most lateral and inferior surfaces. These maps were originally made by the neurosurgeon Wilder Penfield from cortical stimulation studies during brain surgery, but they have been replicated using multiple modalities including functional MRI and magnetoencephalography. Last but not least, in hearing, we have a representation that’s depictive of frequency. This originates in the cochlea and is represented in primary auditory cortex. This is an interesting representation, because rather than using space to represent space, it’s using space to represent frequency. The basilar membrane of the cochlea is arranged within this residence chamber such that the apical part of the cochlea resonates to low frequencies and the most outer part resonates to high frequencies. And this gives rise to a smooth frequency representation in the cochlea. The hair cells along the cochlea project systematically to the primary auditory cortex depicted here. So in the banks of Heschl’s gyrus, there is a smooth tonotopic representation such that the apex of the cochlea projects the anterior part and outer part and the base projects the posterior part, giving rise to a representation that uses space in the brain to encode a pitch out in the world. We have depictive representations of visual information, of somatosensory and motor information, of auditory information in the brain. Somehow, these are related to our experiences of these dimensions, and eventually to the way that we talk about them. So, a “grand challenge” for cognitive science has been what are the respective roles of depictive and propositional representations in our cognition. One proposal comes from Jerry Fodor. This is the language of thought proposal. In Fodor’s view, the primary role of a depictive representation is simply to be an interface to transduce experience. And the real action in the central neural system happens in terms of propositional representations. So what perception does is taking a continuous dimension, translated through a set of depictive representations in order to give us propositions and all the real work gets done on propositions.

The Structure and Format of Event Representations

23

This is closely related to the position that Zenon Pylyshyn took in the imagery debate. In the series of debates, primarily between Steve Kosslyn and Zenon Pylyshyn, this is an argument as to what extent people’s cognition about visuospatial information depends on depictive representations, as opposed to those representations merely serving as transduction mechanisms. There is currently a debate going on about the role of depictive representations in language comprehension, where proponents of embodied approaches to language comprehension argue that depictive representations are fundamental for our comprehension of language, that in order to comprehend a sentence such as, “Jonah kicked a ball”, I need to activate the appropriate representations in my somatosensory model cortex, whereas opponents of this view argue that the real action is happening in terms of semantic representations that are fundamentally arbitrary in their relationship to perception. Other areas where the depictive representations have been explored and where there has been significant disagreement about the range of application include forward models in robotics and forward models in language production. This is a question of the extent to which when I am trying to control an artificial or natural effector or when I am trying to articulate the sentence. I am building an anticipatory representation of what the consequence of what particular actions are going to be. And in longer range action planning which maybe offline from strict model control, similar issues come up. In these debates, most of the focus has been on isomorphism: to what extent do the representations that underlie cognition; to what extent, are they isomorphic to the things being represented. But I want to argue that we should not forget about the questions of whether the representation is holistic or componential; that’s super important too. Let’s now turn again to our contrast of a picture to a propositional representation, this case is a piece of computer code. We’ve got this dimension where this end is both depictive and holistic and this end is both arbitrary and componential, and we mostly tend to think about it as isomorphic versus arbitrary. But you know, if we remember that there is another dimension of holistic versus componential, this opens up the question of are there combinations that say are holistic but arbitrary? I am sure there are. How about isomorphic but componential? And it’s this quadrant that I really think is super exciting. In this quadrant lands things like this. It’s a children’s toy that was a favorite in my house when I was a kid, and then in my new house when my children were kids. It’s called a big loader construction set. It has all these components that represent isomorphically things in the world. So you’ve got diggers and haulers, and trams and so forth. They have oneto-one spatial correspondence to the things that they represent in the world.

24

Lecture 2

But the components of this system interact in a meaningful way. That’s what makes it fun for kids to play with. These are models, and I want to argue that models constitute a really important class, a representation in cognition by virtue of the fact of having this unusual combination of being isomorphic and componential. That’s an important powerful combination. There are a range of different sorts of models that can come into play. It’s helpful to distinguish those different sorts of models. To lay this out, I want to turn to a taxonomy that Gabriel Radvansky and I (Radvansky and Zacks, 2014) proposed in a recent book. First of all, I want to make clear that we are talking about mental models here as opposed to models on the world. The big loader construction set is an example of an actual model in the world. Here we are talking about mental models, mental representations. There are two main sorts that are important for cognition: system models and event models. System models are models of classes of experiences. I could have a system model that describes how a piece of equipment like a toilet or a fan functions and how the parts interrelate; or I could have a system model that describes how the components of a type of experience typically relate to each other; I could have a system model for going to the restaurant or for playing a game of cricket. And then event models are models of particular experiences that one has had in one’s life. Within a system model, we can distinguish between physical system models—models of things like toilet walls and fans versus abstract system models—models of things like how a game of goal or chess works or how one progresses to a university degree. Within the domain of event models, it’s helpful to distinguish between event models that we construct from language, which—following Kintsch and van Dijk’s (1983)—are called situation models and event models that we construct from direct experiences which we called experience models. So, these are models of particular experiences, these are models of classes. Let me say a bit more about the role of system models in cognition. There is a range of different sorts of system models. But the one thing I think that are most relevant for the topic of event comprehension are scripts and schemas for events. A script or an event schema is a representation of what you know about how that typical experience goes, so if you have to go back to the experience of visiting many restaurants over your lifetime, there is some accumulated knowledge that integrates over those experiences and allows you to form expectations about what is going to happen, who is going to be there, what objects will be present and in what organization. That kind of representation

The Structure and Format of Event Representations

25

can guide your attention, it can combine a series of inferences, and it can serve as a retrieval guide when you are remembering things later. These kinds of representations have been given different labels over the history of cognitive science. The notion of a script was proposed by Roger Schank and Robert Abelson (1977), particularly to characterize are system models for experiences that are conventional and social in nature—things like going to a store, having a birthday party, or going to a baseball game or a football game. Event schemata are a broader term originally proposed by Sir Frederic Bartlett (1932) and characterized by David Rumelhart (1980) and other people in the seventies. The notion of event schema relaxes the social conventional components that was important in the development of scripts. Another kind of system model that’s relevant for understanding events are the permission schemata that were proposed by Leda Cosmides and John Tooby (1994). A permission schema is the kind of thing that allows you to anticipate what is socially acceptable or not socially acceptable in a particular situation. In their work, they highlighted the cases where people deviate systematically from normative principles of reasoning, because rather than using formal logic, the reasoning depends on our permission schemas. To illustrate one of the kinds of effects a system model for an event can have, I want to consider this passage. Now, first, just have a quick look at the beginning and raise your hand if this passage is familiar. Ok, good, I think we are safe. Let me read it out. It says The procedure is actually quite simple. First you arrange things into different groups depending on their makeup. Of course, one pile may be sufficient depending on how much there is to do. If you have to go somewhere else due to lack of facilities that is the next step, otherwise you are pretty well set. It is important not to overdo any particular endeavor. That is, it is better to do too few things at once than too many. In the short run this may not seem important, but complications from doing too many can easily arise. A mistake can be expensive as well. The manipulation of the appropriate mechanisms should be self-explanatory, and we need not dwell on it here. At first the whole procedure will seem complicated. Soon, however, it will become just another facet of life. It is difficult to foresee any end to the necessity for this task in the immediate future, but then one never can tell. Now if I came to this passage, having not read it before, I would be confused. I would not know what elements of it to attend to or not. If you ask me to recall it for you after a day or a week, my recall would probably be pretty helpless.

26

Lecture 2

But if I first read this [[shows text “the paragraph you will hear is about washing clothes”]] before reading the first sentence, that might transform my understanding of that passage. John Bransford and Marcia Johnson (1972) in this study did exactly this. And then they asked people to recall the passage after varying delays. After a relatively short delay, what they found is that the amount of information that people could recall, if they had not been given this topic, was embarrassingly low. Moreover, in this case, giving the information about the topic after the fact did not improve their recall, so if that information was not in place to guide their encoding of the story, that did not help. But if they got this [[“The paragraph you will hear is about washing clothes”]] before, they recalled almost twice as much information. So what this illustrates is that the system model is guiding their comprehension presumably by directing their attention to the features that are relevant, allowing them to draw inferences about what will be mentioned and what’s their role will be, and facilitating the construction of an event specific representation that allows the information to be stored in the way that is more durable. Here is another example of the system model acting in the context of comprehension and memory. In this study by Gordon Bower and his colleagues (1979), people read a series of stories that were generated from scripts. So here is an example of three stories that they generated from a script for going to a health practitioner. They started by first asking a number of undergraduates to list what typically happens when you go to a health practitioner. And then they use those listed things to develop norms for what were the typical steps to go to a health practitioner and then they constructed a number of stories from those norms. So I’ll just read the first one. John was feeling bad today so he decided to go see the family doctor. He checked in with the doctor’s receptionist, and then looked through several medical magazines that were on the table by his chair. Finally the nurse came and asked him to take off his clothes. The doctor was very nice to him. He eventually prescribed some pills for John. Then John left the doctor’s office and headed home. You might read that story and then you might read a story about going to the dentist. Some of the elements of the story you notice comes from the scripts. So “John was feeling bad today. Bill had a bad toothache”. In both cases, these come from the step the previous subjects have generated, saying there is an initial symptom. Then there are other things that come from the scripts that are present in some of the stories but not others. For example, here it says “he eventually

The Structure and Format of Event Representations

27

prescribed some pills for John”, so there is the description of the treatment being administered. But in the dentist story, there is nothing about the treatment. Some of steps were included in some of stories but not others. And there was other information that didn’t come from the scripts that was included. For example, “He wondered what the dentist was doing”. That doesn’t come from the script. Or “the doctor was very nice to him”. That doesn’t come from the script. So each story had some events in it that came from the script and some that didn’t. Each person read 18 stories. They might read 1–3 stories from the health practitioner script, and then 1–3 stories from a restaurant script, 1–3 stories from a party script. Multiple scripts, multiple stories, but we are going to vary the number of stories that are coming from each script. And then afterwards, they were given a recognition test in which they were presented with sentences that they had read and other sentences that they had not read. And [[they were]] asked to say which sentences were in the story. Some of the sentences that they had not read included things that were not stated from the script. For example, if I read a sentence that said the dentist extracted his tooth, that wasn’t actually in the story, but it is in the script. And then some of these other actions, so if I read, “The doctor was very nice to him”, that is not in the script. And for some of the subjects, that sentence might not have been in the story either. Here’s what we find. For sentences that were in the script and were actually in the story, people are very confident that those sentences actually were in the story. But they are also quite confident about sentences which were never presented in the story if those sentences fit the script. If you present me with something like “the dentist extracted the tooth”, I am quite likely to falsely recognize it though it wasn’t in the story—especially if I read two or three stories generated from that health practitioner script, a little bit less likely to do so if only I read one. Compare that to sentences that perfectly plausible but are not in the script: People are much less likely to falsely recognize those as having been presented. You might think, OK, if I am looking at these sentences in front of me, I might make some inferences about whether he could have been in the story; but if you just ask me to tell the story in a free recall situation, I’d be very unlikely to falsely recall this kind of sentence. But in fact, we see this in recall as well, so if people are just asked to tell the story after a delay, they do pretty well recalling the sentences that were actually presented, but you can see that they are much more likely to falsely recall sentences that are plausible in the script but weren’t actually presented. Now, this has to be coded at the level of the gist, because I’m not going to use a particular

28

Lecture 2

wording. But they are quite likely to produce those compared to other actions that weren’t mentioned at all. And again, they are more likely to do so if they saw multiples stories from the same script. Both this recognition result and recall result indicate system models acting as a bias on people’s memory. When I go to think about a story that I read, part of what I use is my knowledge about how stories about that kind of activity usually go. And that exerts a bias on my memory. And this is the key cell of that finding. So, those are some ways that system models can influence our conception during language comprehension. Event models also play a major role in our conception during language comprehension. I think this will be a good place to take a short break and then we will come back and talk about those language models. We talked about the role of system models in comprehension and memory for language. I want to say just a little bit about the role of event models. I should say both of these are topics that we are going to turn to in much greater detail later, but I want to make these initial remarks here with respect to the structure of these representations that I described in the very first part of the lecture. Many of you will no doubt be familiar with the description from Teun van Dijk and Walter Kintsch (1983) of three levels of representation in language. The surface structure is the exact wording and syntax of an utterance. The propositional textbase is the propositional content of those utterances. And at the point that [van Djik and Kiintsch] were coming along, discourse comprehension mostly treated this as the whole deal. And they said, “Look, if you want to account for how people comprehend and understand language, you really need to propose that, above and beyond these kinds of things, there is a representation of the situation that the text is about that is divorced from certainly the particular surface form of the utterance and also from the particular propositions that are asserted by that utterance, that really corresponds to the underlying situations that’s been represented.” In other words, it’s a model of the situation. That’s why they use the term model. I just want to remind us that in the terminology, I will use event models and situation models interchangeably when talking about models derived from a discourse. We distinguish between situation models and experience models, which are models derive from an actual experience. Both of them are kinds of event models. In this experiment by Kintsch and his colleagues (1990), participants read sentences like “Nick decided to go to the movies”, “He looked at the newspaper”—little minimal pairs where you have two sentences that paint a little vignette of a situation. And they were tested with original sentences to

The Structure and Format of Event Representations

29

which they should say “Yes, I read that” and altered ones to which they were instructed to say that “No, I didn’t read that”. Some of the altered ones were paraphrases such as “Nick studied the newspaper” rather than “He looked at the newspaper”. Oops, I said one word in the surface structure but it’s actually two words as I’ve reproduced it here. Two and half, “He looked at the newspaper” versus “Nick studied”. So, a change of a couple of words in the surface structure changes and one proposition in the text base. They are also given sentences that were inferences such as “Nick wanted to see a film”. This is in the text base but it’s likely to be something that you infer from this pair. They are also given sentences that were not altered in any meaningful way from the original but were just new inappropriate sentences. So “Nick went swimming” is not appropriate in the context of that pair. It’s unlikely to be in the event model or in any of your other representations. They also included sentences which were consistent with but not inferable from those. I will not say much about those. This is the probability that people said yes to those different kinds of sentences after varying delays. What you can see is that at for delays out to four days, people are pretty good at recognizing the sentences that they actually read. It’s about an 80% probability they say “Yes, I read that”. And they are pretty good at rejecting sentences like “Nick went swimming” that are totally unrelated to the context inappropriate for that pair. At the shortest delays, they are also pretty good at rejecting these inferences such as “Nick wanted to see a movie” or “Nick wanted to see a film”. Or the paraphrase sentences such as “Nick studied the newspaper”. But if you wait just a little more than half an hour, they have a lot of trouble [[in]] rejecting those paraphrase sentences. They are quite confident that they saw them. And they are much more confident at that they saw the inference sentences. What Kintsch’s colleagues concluded is that the surface information and even the text base quickly fade from memory. What remains is the model of the situation that was described by the text. So these kinds of considerations invite us to ask what is the internal structure of event models. What Gabriel Radvansky and I (2014) proposed is that some of the key elements of the internal structure of event models include a spatiotemporal framework, a location, spatial setting, an interval of time during which the happenings unfold, and a set of entities and relations among those entities. You can see that this is entirely connected with the situation semantics of Barwise and Perry (1983) that we talked about in the morning. I want to emphasize that this is an aspect of the event representations that is domain specific. We have representations in our minds for a lot of different kinds of things: objects, people, and social organizations, and they don’t need

30

Lecture 2

a spatiotemporal framework and a set of entities standing in relations. This is about the content of events in particular. I want to focus most of our remarks today on the role of the spatiotemporal framework. This comes an experiment from Radvansky’s lab (2017). In this experiment, people read a bunch of sentences that place the objects in location. One subject might have read a set of sentences that would include this six: The welcome mat is in the laundromat. The pay phone is in the laundromat. The oak counter is in the laundromat. The potted palm is in the hotel. The potted palm is in the museum. The potted palm is in the barber shop. Of course, this group falls into two sets of three: In these three the word laundromat repeats, in these three the noun phrase potted palm repeats, but note that in these three the noun that repeats is the setting—they share a common spatiotemporal framework—whereas these three locate the same object, but in three different spatiotemporal frameworks. In terms of their complexity, in terms of propositional representations, these things are equivalent. But in terms of their event model structure, they ought to be quite different. If people are using an event model to comprehend and encode these, one can encode a common situation model that includes all three of these assertions, but there is no common situation model that encodes all three of these. The number of different items in which a noun phrase occurs in this kind of paradigm is called the “fan”. This is a term coined by John Anderson (1974), and it’s the number of connections between one of the entities in potted palm or laundromat and the things with which it’s paired. If you only read one of these sentences, then the fan would be one; if you read two, and the fan would be two. In both cases of these examples, the fan is three. What Anderson discovered is that the time to retrieve an arbitrary association depends on the fan: The more pairings that you encode, the slower it is to retrieve any particular pairing. However, Radvansky and his colleagues proposed that if multiple items can be encoded in a common spatiotemporal framework—encoded in a common situation mode—and if they’re relying on the situation model to retrieve, then fan shouldn’t have an effect. To test this, they tested recognition memory for the sentences and they did this over delays ranging from immediate to two weeks. What we are looking at here is the proportion of errors that people make. You see exactly the same pattern in the length of time it takes to respond correctly. What you find is that if the potted palm is encoded in multiple locations,

The Structure and Format of Event Representations

31

people are more likely to make errors the more locations it’s encoded in. And the overall rate of errors increases, but in fact, the fan increases quite dramatically if you test at a longer delay. Whereas if you encode in three items in the laundromat, there is no effect of fan immediately, and if anything there is very minimal effect of fan after a two-week delay. What this suggests is that the spatiotemporal framework is organizing people’s comprehension in a way that facilitates robust memory after even quite long delays. Now within that spatiotemporal framework, there may be a privileged point of view of the observer, certainly for lived experiences, we experience some from some particular location, and even in narrative, there is in general a point of view from which things are described. So if you read a passage like: You and Ted are both anxious to continue exploring the barn; Ted turns to face the hammock; You turn to face the pitchfork; Both the pointed end and the rounded tip of the handle stretch beyond the crates on which the tool lies; One of the farmhands seems to be a bit lazy about returning things to their proper places, this is a passage from an experiment by Nancy Franklin and her colleagues (1992), there is clearly two points of view in this passage. There is the point of view of the reader and the point of view of Ted. What they did in this experiment was to ask people to read this kind of passage and then to answer questions about the locations of the objects either from their point of view, from that second person, reader point of view, or from Ted’s point of view. What they found is they are much faster to answer questions when those questions are given about the locations, when those questions are asked from their point of view. In some cases, when we are reading about events in the spatiotemporal framework, we are representing the privileged point of view of the experiencer in that framework. Ok, now I want to turn now to the depictive nature of event models in language comprehension. And much of this view has been articulated with wonderful clarity by Lawrence Barsalou (2008). The hypothesis that he proposes is that event models and in particular situation models from language, are implemented in part by isomorphic representational maps, the kind of depictive representations in the brain that I described this morning, knitted together into componential structures (Barsalou, 2008). So this is the marriage of isomorphism and componentiality that I described. And this entails some consequences: if this is right, then ongoing comprehension ought to show signs of modality-specific perceptual and

32

Lecture 2

motor representation. That’s to say, as we’re reading, we ought to see the foot prints of these depictive representations on our cognition. Now there has been a lot of interest in the role of depictive language representations in language comprehension, and in behavior, and in work with clinical patients who have language comprehension deficits and in neurophysiology. I’m going to describe today a few results from using a physiological approach. In the fourth lecture, I will describe a few more results from behavioral and imaging approaches. To build an interpretation of this, I just need to make sure that I say a little bit about the nature of the measures. As many of you no doubt know, this is a picture of a magnetic resonance imaging machine. The data I am going to show you come from a particular kind of magnetic resonance imaging that is aimed at looking at the functional properties of the brain. I don’t doubt that this is a review for most people, but I want to highlight a few things about functional MRI [fMRI]. A functional MRI [fMRI] scanner is a big magnet. Embedded within it is a set of smaller magnets, [[which]] can produce rapidly changing magnetic gradients. Some key things about using this technique are that it is not invasive, we can ask healthy observers to complete cognitive tests in the scanner without administrating any invasive procedures. It measures the local concentration of deoxyhemoglobin. So in structural MRI, which is the kind most of us encounter as patients in a hospital, we use the imager to measure the local concentration of water which tells us about the presents of things like BOLD and other tissues. In fMRI, we tune the machine to tell us about the local concentration of deoxyhemoglobin. It turns out that if you use a part of your body, like say perform arm curls, your body routes extra oxygenated blood to that area, and we see an increase in the local concentration of oxygenated blood in that areas. This turns out to be true on a pretty fine scale within the brain. So that is the signal we are looking at and the fact that the blood signal is an important because the blood response operates on a much slower time scale than the neural response. We can sample about every one second with the current technology, getting imaging elements size of about three millimeters. But we are limited not so much by the imaging technology as by the spatial scale and the temporal scale of the blood response that is involved. With that in mind, we can see responses that happen on the scale a few seconds over relatively small regions of the brain. In the study I have described, we use this technique to look at the responses in the brain while people read long-extended narratives. I want to just digress a little bit to tell you about the narratives that we use, because I am going to

The Structure and Format of Event Representations

33

return to these over the next several lectures and they are just really fun. There was a psychologist named Roger Barker, and he got a large grant from the US National Science Foundation in the 1940s to establish a field research station in a small town in Kansas. Barker’s idea was that we know very little about the ecology of human behavior. We spend a lot of time in the laboratory exploring the function of behavior. But it’s kind of like, in biology, I can go to one of my colleagues’ offices and ask them about the physiology of the frog’s breathing system. They can tell me the experimental results that characterize how the frog breathes. Then I can go down the hall and ask another biologist where do the frogs live and what times of day they come out, and what times of day they [[are]] active or sleeping. And my colleagues can tell me that stuff. In psychology, we do a lot of the equivalent of studying how breathing works, but we never ask where the frogs live, right? We don’t ask what kinds of behaviors occur, and which kinds of people engage in what kinds of behaviors and in what times of day and in what physical settings. Barker and his colleagues (1966) set out to study this, so they embedded in this small town and they just observed lots and lots of behavior and recorded it in many different ways—which brings me to these stories. One of the kinds of data that they collected were what they called day records—detailed records of the behavior of a person over the course of a day. One of them was published as this book One Boy’s Day, and that’s exactly what it is. They had a team of 12 observers follow a 5-year-old [[boy]] around the small town from the moment he woke up in the morning to the moment he went to bed that night and rotating shifts of three. And one of the fun things about doing this with kids is that they don’t care; they just keep behaving the way they normally behave, whereas adults change their behaviors a lot if they know they have been watched by a team of 12. This is a 220-page book. They just describe everything what this kid did over the course of a day. It’s kind of like the famous James Joyce’s novel Ulysses, it’s just really boring. Nothing happens. It starts out something like this: … [Mrs. Birch] went through the front door into the kitchen. Mr. Birch came in and, after a friendly greeting, chatted with her for a minute or so. Mrs. Birch needed to awaken Raymond. Mrs. Birch stepped into Raymond’s bedroom, pulled a light cord hanging from the center of the room, and turned to the bed. Mrs. Birch said with pleasant casualness, “Raymond, wake up”. And it just goes on like that for 220 pages. This is great from our point of view, because as someone who is interested in narrative language and also in events

34

Lecture 2

that are not narrative but that are visually experience-directed in real life, this is a kind of narrative that is closer than usual to our real experience. It doesn’t jump around in terms of time or location, it’s very naturalistic, and it is perfectly good prose. For 10 or 15 minutes at a time, it’s actually entertaining and easy to read and fun to read. So our participants are very happy to read these narratives. And they are nothing if not descriptions of events. So in this study, the data I’m going to describe come from people reading about Raymond, reading excerpts from this book, four long excerpts about 15 minutes’ total lying in the scanner reading. They read them presented one word at a time on the screen and we can time-lock the functional MI response [fMIR] to the onset of the words. Then, for this study, we had another group of undergraduates read each clause in the narratives to code for how highimagery content they were. I will tell you about two data sets. One of the data sets had four episodes about 15 minutes long, the other had shorter episodes, paragraph length. Half of the paragraphs that they read were intact episodes and the other half were scrambled so that each sentence was posed from a different episode. The readers knew before each paragraph began whether it can be an intact or scrambled paragraph. What I have done here is to take a few of these sentences and just highlight the ones that were rated as being high-imagery content by our undergraduate coders and then code the modality of the imagery that that clause represented. The coding of the modality was done by two of the experimenters and they had high levels of agreement. So it turns out once you identified the high-imagery clauses, it’s pretty easy to pick off whether they are auditory, motor or visual. “Susan stood leaning against a nearby tree” is a high-imagery clause, that’s visual. “She ambled over behind them” clearly has a lot of motor content. “They sighed” is clearly auditory. “Took great gulps of air” is going to be motor. “Susan suddenly grabbed Raymond’s cap”, that’s motor, “and ran off”, that’s motor. “Raymond Pechter excitedly called, “Catch her, Catch her!””, that’s auditory. So it’s easy once you identified the high-imagery sentences to code the modality, and if you do this for the four long stories, you get this kind of summary of the distribution of high-imagery clauses and what their modality is. What you can see is that there is some of them have more of this high-imagery content than others, and they all have a smattering of each of these three kinds of content. Then, we can time-lock brain activity to the incidents of these high-imagery clauses. So here I am showing you lateral views of the left hemisphere and right hemisphere and then medial views of the left hemisphere and right hemisphere. In that first dataset, these areas were selectively activated by

The Structure and Format of Event Representations

35

auditory clauses and by motor clauses. Before, when I showed you primary auditory cortex, Heschl’s gyrus, it’s about here. You may recognize that these are the classical regions that are associated with speech comprehension and production. These are auditory-specific processing regions; if you have a lesion to this area, you will be aphasic. These are the right hemisphere homologs. This corresponds to the somatosensory representation of the hand. This is an area of premotor cortex. Importantly, both of these are localized to the left hemisphere, the participants are right-handed. The motor system is mapped control laterally, so that when we move our right hands, we selectively depend on the left hemisphere by the first side. So it’s highly significant that this happens just in the left hemisphere and not in the right hemisphere. You get areas that are known to be functionally important for hand movements of the right hand activated when people read motor imagery, and you get areas that are important for all language comprehension activated when you read auditory imagery clauses. So these are consistent with modality-specific processing. Here is the data from the second dataset. We replicate very clearly the same pattern as from the first dataset. We also see significant activation for the visual high-imagery clauses. Not so much in classical visual area actually, but in mostly extrastriate visual areas. Importantly, in the second dataset, all of these effects are obliterated if people read the scrambled paragraphs. You only get this result if readers are able to construct the coherent situation model, and they know that the stimulus won’t allow them to construct the coherence situation model. That’s one piece of evidence that people during ongoing comprehension are selectively activating modality-specific representations. Let me show you another analysis that provides a converging way of looking at the data. Here we’re not looking so much at the sensory modality, but we are looking at the content in a way that allows us to use previous results from actual perceptual and motor experiments to make judgments about people’s comprehension. I want to introduce to you, if you are not already familiar, to a couple brain areas that have been found to have highly content-selective response properties. The parahippocampal gyrus is an area here on the ventral surface of the brain, adjacent to the hippocampus, that responds selectively when people look at pictures of spatial environments, or when they think about spatial environments. People with lesions to this area have topographic amnesia, which means that they have a grave difficulty learning new environments; they get lost very easily. A part of the parahippocampal gyrus was characterized by Russell Epstein & Nancy Kanwisher (2000) in experiments where they compared brain activity as people looked at pictures of locations to brain activity when people looked

36

Lecture 2

at anything else, pictures of faces, pictures of objects, pictures of scrambled faces and objects. This is a coronal slice taken about through the middle of the parahippocampal gyrus. You can see, bilaterally, this area responds specifically to spatial locations. This is another study, very different paradigm, conducted by Amy Shelton and John Gabrieli (2002), in which they had people playing a little video game, in which they had to navigate through a maze. They did this either from a first person perspective or by viewing it in overhead view from a map. These areas are selectively activated when people were engaged in the spatial transformation from the first person’s perspective. These data are visualized in an axial slice going this way. These are the same regions here in the coronal slice, shown here in the axial slice into very different paradigms. So that’s the parahippocampal gyrus. Now I want to say more about the regions that were activated for the high motor imagery sentences. In a review by Umberto Castiello (2005), he described a large series of experiments in which people executed motor actions in the scanner. Experimenters have built equipment so that people could lie in the scanner and reach out and perform various kinds of grasping movement while brain activity was recorded. When people execute grasping movements in the scanner, you see activity in the premotor cortex [PMC], and somatosensory cortex [SC]. And as I said before, if you do this with your right hand, they will see selective activity in the left hemisphere. So we can ask as people are reading things like “Mrs. Birch walked into the bedroom”, which part of brain are selectively active? These are brain areas that are selectively activated as people are reading “Mrs. Birch walked into the bedroom”, and you can see this corresponds really wonderfully to the areas that we saw activated when people are looking at pictures of space or navigating in VR. So it’s the same areas of parahippocampal gyrus. In a separate study in which people watched a narrative film, if we time-lock brain activity to visually experiencing walking into a new room, you see exactly the same areas. So these are different subjects, completely different stimulus and you see that same selectivity. I think this is pretty decent evidence as people are reading for comprehension, comprehending a virtue of these representations that are the same ones they are using to process visual experience. For grasping, we can ask what happens when you read “Mrs. Birch reached for the lamp cord” and compare that to visual observation of characters reaching for things in a film. What you see is that the premotor cortex and somatosensory cortex are selectively activated for both of those stimuli at those points, and critically, for both of them, this is selectively to the left hemisphere, which

The Structure and Format of Event Representations

37

is consistent with the idea that our right-handed participants are using the representations that they would use to plan an actual movement. You see other things that are activated for both reading and for movie viewing, but those responses are not left-lateralized to the same degree. At least in some cases, it appears that event models from language appear to be depictive. In this regard, they converge closely with event models from visual experience. I want to conclude by emphasizing that I think that this feature of an event model structure—that is both depictive but the same time componential—underlies a lot of power for the componentiality of language and also for our episodic memory functions and our ability to plan for the future and to remember the past. I will stop there and we will pick up tomorrow with an examination of how experience is segmented into meaningful events as we perceive it. I will start with a theoretical exposition and show you a lot of data. But I think we can stop here and take some questions.

Lecture 3

Event Segmentation Theory and the Segmentation of Visual Events To begin, I’d like to turn back to the philosopher William James and particularly to the passage in the Principles of Psychology (1980) in which he coined the phrase “stream of consciousness”. This is a quotation that is often reproduced, in which he writes: “A ‘river’ or a ‘stream’ are the metaphors by which [consciousness] is most naturally described. “In talking of it hereafter, let us call it the stream of thought, of consciousness, or of subjective life”. But when this paragraph is reproduced, they always end the quote right here. He immediately goes on, however, to contradict himself, saying “But now there appears … a kind of jointing and separateness among the parts, of which this statement seems to take no account. I refer to the breaks that are produced by sudden contrasts in the quality of the successive segments of the stream of thought….” He later goes on to compare those to joints in bamboo. He says “The transition between the thought of one object and the thought of another is no more a break in the thought, then a joint in a bamboo is a break in the wood. It is a part of the consciousness as much as the joint is a part of the bamboo”. It is these joints that I want to focus on today. I want to make an argument today that the jointing of the stream of consciousness, the parsing of the ongoing stream of behavior into a sequence of meaningful events is a fundamental cognitive operation that shapes our perception, our action planning, our memory and our language. I am going to present a theory that tries to account for how this segmentation occurs and I will show you some data that tests the theory in the context of visual narrative events. To motivate this a little bit more, I’d like to just demonstrate a task that we often use in the lab. This is task that is originally developed by the social psychologist Darren Newtson. What he asked people to do was to press a button All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982434.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_004

Event Segmentation Theory and the Segmentation

39

(in this case I will ask you to clap) whenever, in your judgment, one meaningful unit of activity ends and another begins. There is no right or wrong answer; I am simply interested in your judgments. In this case, I’d like to give you a further instruction to mark off the smallest units that you find to be natural and meaningful. I’m going to show you a short movie and I just like you to clap whenever you perceive that one event has ended and another has begun. Are you ready? Here we go. [Watch a video clip] Ok, I think you get the idea. People rapidly, with virtually no instruction (you had approximately zero trials of practice) can come to agree, not perfectly but with a high amount of regularity, on where the boundaries are between the events. Most people, when they read these instructions in my lab, look at me funny and say “well, what do you mean by a natural meaningful unit? How do I know if I am right or wrong”. We tell them there is no right or wrong answer. They look at us as if it’s very strange. Then they—like you—quickly find that the task is intuitive. What you see when you sample large numbers of people is that there is strong intersubjective agreement about where the boundaries were. I hope you observed the synchrony in your clapping. This is a little bit larger dataset. This is data from two hundred observers. Each of these pink dots is a point at which one person pushed the button. This is the estimated density of how many (or what proportion) of people were pushing the button in a given moment. What you can see is there are very strong peaks and valleys. These are places that everybody identifies these boundaries and there are places basically nobody identifies these boundaries. We see this very robustly that viewers agree on where the event boundaries are. Viewers can also modify the grain in which they segment. So, I ask you to identify the smallest units that were meaningful to you. If I then ask you to watch it again and give me the next higher level and next longer duration of unit that you thought was meaningful, you would find that pretty intuitive. You can probably go two or three or four levels higher. We often ask people to either give us the smallest units that are natural and meaningful, or the largest units that are natural and meaningful. And if we really want to get precise about a particular grain size, we can give people feedback. So if I ask you to segment for two or three minutes and then say “That was great, but could you maybe try it again and segment it a bit more finely”, then people can rapidly converge on a consistent level so that we can compare across observers. This just gives you an illustration in one recent study (Zacks et al. 2013) of the kinds of distributions of unit sizes that we observe. In this study, people segmented three different movies of everyday activities like the one I showed you.

40

Lecture 3

On one occasion they were asked to identify the largest unit that was meaningful to them. On another occasion they were asked to identify the smallest units that were meaningful to them. What you find is that the peak of the distribution for the fine units is somewhere around 8 or 10 seconds with a mean of about 10 to 12. The coarse unit distribution is more spread out but is about twice as long. And if you look at a given individual, what you will find is that for virtually everybody in the sample, they are able to follow the instruction. So that if you compare, a given individual’s fine units to that same individual the coarse units, the coarse units are always longer as we instruct them to do. Occasionally, we see someone who just does not get it. But that’s very rare. At both of these grains of segmentation, there is strong agreement about where the boundaries are. These are some older data where a group of observers segmented the same movie, a movie of making a bed, once to identify coarse-grained event boundaries and once to identify fine-grained boundaries. What you find is that for both the coarse and the fine, there are these strong peaks and valleys. At both coarse and fine grained segmentation, there is strong agreement across observers about where the boundaries are. We find this to be very robust across different populations. These are data from a group of adults and a group of children. In this case, they were watching a video from a Norwegian Claymation series called Pingu, which I can highly recommend. What you can see is that for both the adults and children, there are strong peaks and valleys in where they see the event boundaries. Across the two groups, the peaks and valleys line up pretty well. We don’t see a lot of systematic differences between groups in the location of these peaks and valleys. That turns out to be important. Within adults, if we compare younger adults to older adults (in this case, they were segmenting a movie of a man doing laundry) what we find is that there is a strong agreement in both the older and younger adults. Even in the group of older adults who have early stage Alzheimer’s disease, we find the location of their peaks and valleys agree strongly with that of the healthy older adults and of the younger adults. It could be possible that there would be big group difference such that people with different life experiences or different stages of life would really segment the activity differently. That doesn’t seem to be the case. You will have noticed that the degree of consistency seems a little bit higher in the younger adults and maybe lowest of all in the older adults with Alzheimer’s disease. Some of these people are missing the consistently identified event boundaries. We will see that the degree to which that they are able to hit the event boundaries that there is cohort hitting is important. Even in adults with brain injuries, we see strong agreement. This is a different video, in this case, a woman doing laundry. Here we are looking at the

Event Segmentation Theory and the Segmentation

41

probability of segmentation over time at coarse grain and fine grain for two groups of observers’ controls, these are all controls and men who experience traumatic brain injury. What you see is that again, there is a strong agreement as to where the boundaries are and there’s great agreement across the groups in where the peaks are. But like with the older adults with Alzheimer’s disease, the brain injury group shows a little bit less agreement than the healthy group. Again, that turns out to be important. So, given that people seem to parse the ongoing stream of behavior in a common agreed upon way, it’s helpful to quantify the degree to which a given observer is seeing and segmenting that activity in the same way as other observers. A measure that we’ve developed in the lab to quantify this we call segmentation agreement. I will walk you through the algorithm for calculating segmentation agreement: You bring a bunch of people into the lab and ask them each to segment the activity at a given grain, fine or coarse or something else. Then we bin the data into one second bins. So for each observer, we record whether they press the button during each second of the movie. Then we can construct the norm by adding up all those binned segmentation data across the observers. So, we can identify the time points at which may people segment and the time points at which few people segment. Then for each individual, we will correlate their binned data with the summed norm. So each individual now is represented as a vector of zeros and ones, and the group is represented as a vector of the sums. We can calculate the correlation between the individual and the group. If the individual is segmenting the places that are chosen by everybody else, they will have a high correlation. If they are segmenting the places that others don’t identify as event boundaries, they will have a low correlation. The raw correlation has a truncated range that depends on the number of event boundaries that the individual identifies. For example, if I identify just one event boundary when I segment a movie, then very best correlation I could achieve is something less than one. And if I pick the very worst point, the very worst correlation I could achieve is something greater than negative one. So the range of correlation is truncated. We don’t want the range of the segmentation agreement measure for some situations to depend on the number of boundaries of the first identifiers. So often we will rescale the data so that the very worst you could achieve given how many times you segmented is zero, and the very best you could achieve is one. So basically what we are going to do is to correlate each individual’s segmentation pattern with that of the group and ask how well are they picking off the same event boundaries that their colleagues did. Another feature of segmentation that turns out to be important is hierarchical organization. If we compare a given observer’s coarse grained segmentation

42

Lecture 3

to their fine grained segmentation, what we often see is that it appears that the fine grained units cluster together into a larger coarse-grained unit. One way to quantify the degree of that clustering is to compare the temporal locations of the coarse grained boundaries to the temporal locations of the finegrained boundaries. If you consider one observer who identified the series of coarse-grained boundaries here and the series of fine grained boundaries at these time points, we can look at each of the coarse-grained boundaries and measure the distance to the fine grained boundary that was closest to it. In this case, it would be this one. This would be the distance. Then we can ask how big is that distance (how close are these two things in time), compare to what we would expect if the coarse-grained and the fine-grained boundaries were independent, and to the extent that person is perceiving the activity hierarchically, such that the coarse-grained boundaries are a subset of the fine grained boundaries, then these distances ought to be smaller than one would expect. So we can calculate an expectancy for what the observed distance would be if the two time series were independent and then ask how much smaller is the actual mean distance than what we would expect. That’s going to be our measure of how hierarchically organized their segmentation is. Virtually, every group that we’ve looked at, every stimulus that we’ve looked at, every instruction that we’ve looked at produces both significant agreement across observers and significant hierarchical alignment. This is from a sample that I showed you earlier. This is the degree of segmentation agreement, measured as a correlation, as a function of age of the observers. The observers in this study are 208 people who range in age from 20 years old to 80 years old. Zero is the correlation you would get, if there is no agreement across observers. At every age range, the correlation is substantially and reliably above zero, for both fine segmentation and coarse segmentation. So basically, virtually everybody in the sample shows positive segmentation agreement. In the same sample, if we plot the degree of hierarchical alignment over age, what we see is, if here zero would be no hierarchical alignment between their coarse segmentation and their fine segmentation, at every age point there’s substantial hierarchical alignment. What it says is that at every age the degree to which the fine units are close to the coarse units is about a second closer than they would be or more, if the two-time series were independent. So, people are agreeing with each other across age, and they are organizing their segmentation hierarchically across age. To summarize this phenomenon, people can segment ongoing activity reliably with virtually no training. This seems to be picking up on something that

Event Segmentation Theory and the Segmentation

43

is just a natural part of the observing activity, such that we can ask people to push a button to mark it and they can do so with essentially no training. They can indicate event boundaries at a range of temporal grains. We’ve gone from on the order of a few seconds to on the order of minutes. That seems to present no problem. Across grains, segmentation is hierarchically organized. Observers show strong agreement in the boundaries they choose. Phenomenologically, if you ask people what they are experiencing when they do this kind of studies, they tell you that they are indicating for us when they do this task something that’s already going on in their heads. This is a ubiquitous concomitant of their everyday experience of the activity. These observations let us to ask what are the mechanisms that lead one to segment ongoing activity, that lead one to perceive that one meaningful activity has ended and another has begun. Our current account of those mechanisms is this theory that we call event segmentation theory. What I would like to do next is to lay out the architecture of event segmentation theory. We start from the assumption that much of the mechanism of perception and comprehension is forward oriented. It’s predictive. When we are perceiving language, we are making predictions about upcoming words, about prosody, about turn-taking and speakers. When we perceive everyday activity, we are making predictions about body motions, about changes in object contact, about goals, about causal relations. This is super adaptive, right? The general architecture of having a forward-oriented comprehension system that is making predictions about the future is widely conserved across the evolutionary spectrum. It’s valuable, as an organism, to be able to approach things that are good for you and withdraw from things that are noxious. But it is way better to be able to anticipate the good things and the bad things, and guide your behavior before they are right in front of you. If you can see the lion coming out of the jungle and get out of the way in time, that’s way better than trying to get loose once you’ve been jumped. So lots of animals, humans and others, have sensory and perceptual processing systems that are oriented toward the future, constantly making predictions about what’s going to happen on the scale of maybe fractions of seconds to tens of minutes. We propose that in order to make effective predictions in the domain of complex naturalistic human activity, it’s very helpful to have a model of what’s happening now. These are the event models that we talked about in the last lecture. This is a representation that establishes a spatiotemporal framework, locates entities and their relations in that spatiotemporal framework, and allows you to keep track of the things that are persistent and ongoing in the face of the complex, high bandwidth dynamic flux that our sensory organs of

44

Lecture 3

experience. So as we are sitting in a lecture hall, it’s helpful to keep track of the fact that we are in a room with entities in particular locations. There is a speaker at the front even if you happen to be looking at the side. The speaker has legs, even if he happens to be standing behind a desk so that they are occluded. This allows you to predict the trajectory of things that are going to happen in the near future. The trick about maintaining a representation like this is that, on the one hand, it’s got to be protected from the vicissitudes of the sensory input—you don’t want your representation of the world around you to change really nearly every time you blink your eyes or get distracted or every time something gets hidden—but, on the other hand, you have to be able to update these representations from time to time, or it’s going to go still one way to bad predictions. Eventually, this morning’s lecture is going to end. If you are predicting that when you look over here there is going to be a speaker talking, that’s going to be a bad prediction. It’s not going to guide behavior very well. We have to navigate this tradeoff between, on the one hand, maintaining stable event representations that are protected from the complications of moment-to-moment perception and, on the other hand, updating when those models go out of date. So how can we do this? If we had just a little angel standing on our shoulders, poking us and saying “Ok, it’s time to update now”, that would be a great solution to this problem. If we lived in a world where, as children, we got training that said “Ok, this is an event boundary; you need to update”, we could learn how to update from that feedback. That would be great. But clearly humans and certainly other animals don’t live in that world where we get supervised instruction about how to segment activity. We learn to do this updating the same way that we learn to comprehend spoken language: by being immersed in it. And so we need a mechanism for learning how to segment activity that’s unsupervised, that doesn’t depend on an angel on our shoulder or a teacher telling us where the event boundaries are. What we’ve proposed is that you can build such a mechanism by monitoring your prediction error (the difference between what I predicted is going to happen and what actually happens), and then updating at those points at which there is a spike in prediction error. When my prediction error transiently increases, what I am going to do is [[to]] gate open the inputs to my event models, update them, and allow them to settle into a new stable state. Most of the time these things are closed off from the early stages of procedural processing. When the system detects a spike in prediction error, it updates the event models, opening them up to the early stages of perception and comprehension, and also to input from my long-term memory. This could be long-term memory in the form of

Event Segmentation Theory and the Segmentation

45

knowledge—schemas—which is a kind of system model, as described yesterday. And [[the system also is opened up to]] episodic memory, my representations are particular events that I experienced in the past. So if this system is functioning correctly, you can imagine how it evolves over time is that most of the time I should have a good event model and my prediction errors are low. But then from time to time, things are going to change in the world and my prediction errors will spike and at those points I will update my event representation. Hopefully, things will settle into a new stable state, prediction error will go down, and I will continue on. So if things are operating well, I will oscillate between these relatively extended periods of cluefulness, where my prediction error is low, and this hopefully relatively short and relatively infrequent period of cluelessness when my prediction error spikes and I have to update. An initial test of this architecture came from a simple computational model that Jeremy Reynolds, Todd Braver and I (2007) built to specifically test the idea that if one lives in a world that really consists of events, where one coherent chunk of activity usually goes a particular way and then can be followed by any of several other coherent chunks of activity, that a model based on this architecture can take advantage of that structure and uses it to effectively update its event models. An important idea here is that this computational architecture is effective because our world really is sequentially structured. You can predict what’s likely to happen next in a given context—whether it’s making breakfast or being in a lecture or washing clothes. But when washing clothes ends and the person who is doing the washing goes to make breakfast across that boundary between those events, things that are less predictable, and you are going to need to update your model at those points. This model lives in a world where it watches a human figure carry out a set of simple actions. The input representation to the model is the 3D position of 18 points on the body that were acquired from a motion capture system. See the series of actions like chopping wood, eating with a fork [and] sawing. Its job is just to look at those positions—so it’s looking at 54 numbers, the 3D position of 18 points on the body—and try to predict what the position of the body is going to be on the next time step. It just stares all day long at a very large training set that is built by concatenating a bunch of these movies of these all individual actions together. Within an action, things are either approximately or exactly deterministic, depending on the simulation. So within an action, the movement always goes basically the same way. But any action can follow any other action. This is designed to capture this idea that everyday experience consists of sequences where within an event unit things are

46

Lecture 3

predictable, but events can follow multiple different events. The architecture of the model is a connectionist network, so each of these ellipses is indicating a layer of neuron like processing units that project massively to the next layer and to the next layer. Models like this can learn relatively arbitrary functional associations between an input and an output. Again, I want to emphasize the input-output relationship that we are asking it to learn is between the input on time t and then the output is supposed to produce is what the input will be on time t+1. This is a one-step prediction task. It’s just trying to learn what’s going to happen on the very next time point. And if you train a model like this, over time you get better and better and better. But the first hypothesis that is entailed by the theory is that it will do better at getting better for those time steps that are within an event than those time steps that are across event boundary. So, if I am trying to predict from the first time points of sawing wood to the second time point, or from the third time point to the fourth, that ought to be easier than trying to predict from the last time point of sawing wood to the first time point of whirling an ax or one of the other activities. We wanted to test the hypothesis that predictions across event boundaries would have higher error. That’s necessary for this architecture to work because we need there to be a reliable signal that discriminates event boundaries if we are going to update based on prediction error. So what this is showing is, after training that network, the distribution of errors when it’s trying to predict within an event or across an event boundary. The good news is that when you’re trying to predict across an event boundary, prediction error is in general higher. However, there is overlap, which means if you’re to update just by drawing some criterion based on prediction error and updating when your prediction error sees that criterion, then you wouldn’t track the structure of the world perfectly, but you would do pretty well. And we think that’s just right. We don’t think this mechanism is perfect. We think it’s pretty good, good enough to be adaptive, but sometimes you are going to make errors. You might have a time point with a low prediction error where out the world a new event really had begun, or you might have some time point with the higher prediction error where really out the world a new event had not beyond. So then, we made various alterations to the model to test the hypothesis that adding updated event representations would facilitate predictions. The first modification that we did was [[to]] add a set of hidden units that function as the equivalent of the angel on the shoulder, telling the model “Ok now he is sawing wood” or “Ok now he is eating with a spoon”. On every time point, these were given a representation by the system of what the activity was. This asks the question, “If we had perfect knowledge of what the event type is on every time

Event Segmentation Theory and the Segmentation

47

point, if we had a perfect event model, would that allow us to do better prediction?” And then we built a version of the model in which it had to learn the event representations, but it was given information about event segmentation. It had a slightly less intelligent angel standing on its shoulder. This angel didn’t know what was going on but knew whenever something changed. So this angel could say “Ok now you need to update your representation. I don’t know what just happened, I don’t know what’s happening next. But I know now it’s the time you need to update.” And the question is, if you had a perfect signal as to where event boundaries are could you learn the event representations to guide your predictions? Then we built two other models. First, one in which it didn’t have a perfect gate, it had to use prediction error to do gating. But whenever it did gate, it could basically ask that first angel “Ok what’s the event now?” When it did update, it got perfect labels. But if it failed to update when the event actually changed, it would be operating with a still event label until it chose to update. And then the last one is what we call the Full Monty, the whole operation together, so the model had to learn what its event representation should be, and it had to learn them by gating based on prediction error. It gets no help at all. It’s like we are in development. We can characterize each of these models in terms of how well they perform relative to the basic model that just has the three layers feeding one to the next to the next. And that’s what we are showing here. Baseline is this zero percent, that’s how well the initial model does. And if we give explicit labels, seeing what the events are on each time point, that helps the model a lot. If we don’t give it explicit event labels, force it to learn the event representations, but gate the updating perfectly, then the model does just about it as well. If we force it to update based on prediction error but give a perfect label when it does update, it does a little bit less well. And if we force it to learn everything on its own, it’s still a bit better than just the straightforward model. So this model, with no experience other than just watching the world go by, can learn the event structure of this environment and use that event structure to improve its predictability to predict what’s going to happen next. In a sequentially structured world, prediction error indicates event boundaries. And that signal can be used in order to update event models adaptively, so that we can learn what the events are in the world. For the rest of the talk I am going to focus on empirical data that complement this competition model by providing laboratory tests of the theory. I think this will be a good time that we take a short break and I’ll continue on with the empirical data. What I want to do next is [[to]] turn from this relatively abstract computational theory to some concrete empirical demonstrations. The point of these empirical studies was all to test a basic hypothesis that comes out of event

48

Lecture 3

segmentation theory, which is that as we are observing the world, when more things are changing out there in the world we are more likely to have prediction errors and we therefore are more likely to update our event models. So event segmentation theory entails when more things are changing in the world, you more likely to experience an event boundary. We and others have tested this in a bunch of domains. Today I am going to talk about nonlinguistic examples, examples from visual experience events. And then this afternoon I will talk about the language case. So here is a set of studies looking at how changes in our environment affect visual narrative segmentation. One of the early ones involved the simple two object animations. These animations were made by asking two players to control a video game in which one person controls each of these objects. They were given instructions to engage in a little activity. So here [[shows animation]] the green square is chasing the orange circle. Our observers can watch this and can tell us what the activity is that is being performed with pretty high fidelity. We also made a set of animations that were matched to the distance, motion, in the acceleration properties of the human generated activities but were randomly generated. This movie [[shows animation]] has the same average motion, same average acceleration, same average distance between the objects. But it was just generated by an algorithm. We can ask people to watch both of these things. People will guess about what the activity is. They are often pretty confident that it was human generated but in fact it’s random. The hypothesis we wanted to test is that when the features of the movement are changing rapidly, then those points in time would tend to be experienced as event boundaries. The nice thing about these stimuli is that we can perfectly characterize anything we want about the motion of the object because we created it. We can calculate a large number of features describing how the objects are moving. These include the position of the object over time, their velocity, their acceleration, their speed, and the magnitude of their acceleration. Speed is just velocity without direction. Acceleration magnitude is just acceleration with direction. We can calculate their relative position, the distance between them, their relative speed, their relative acceleration. We can know when there are maxima or minima in the speed: when they are going maximally fast and start to slow down, or minimally fast and start to speed up. [We can calculate] when they are accelerating maximally or decelerating maximally, we can calculate when they are as far away as they are going to get and start getting closer together or when they are close and start getting farther away. We can characterize whether they are accelerating away from each other or toward each other, and we can calculate maxima and minima of that. Features that turn

Event Segmentation Theory and the Segmentation

49

out consistently to be reliably predictive of segmentation include maxima in acceleration magnitude (when the things are speeding up the most), distance between the objects, the relative acceleration (are they turning toward each other or away from each other), and maxima and minima in distance. I can illustrate that with this animation here [[shows animation]]. What we are going to see is one of those movies and this is showing you the proportion of observers who segmented over time. Then here I’ve selected a few of those features, distance, relative speed, relative acceleration, the speed of one of the object, the speed of the other, the acceleration of one and the acceleration of the other. What you can see here is, as they are moving around, some points are perceived more as event boundaries. Here when they go apart, that’s perceived by a lot of people as an event boundary. Now they are going to come together and when they get maximally close together and started heading apart, virtually everybody sees that as an event boundary. So, there is a strong relationship here between the distance between the objects and the probability of segmentation. Also there are strong relationships—though not as easily discernible with a naked eye—between these other variables and segmentation. Overall, these variables account for a lot of when people segment. This holds for more complex stimuli. This is an experiment in which we filmed an actor doing a bunch of table-top activities. He was wearing sensors on his hands and head to tell us how quickly those parts of his body were moving. Then we can do the same kind of analysis. Here we are looking at again segmentation frequency, as a function of how fast the left hand is moving, the right hand, the head, how much the left hand is accelerating, the right is accelerating, how much the head is accelerating, the distance between the two hands, and the distance between each of hands and the head. In this case he is paying a set of bills. I think he is going to put down the pen in a second. He puts down the pen and a bunch of those motion features change, and people experience that as an event boundary. For both very simple stimuli and for more complex naturalistic activities, when motion features are changing a lot, people tend to perceive event boundaries. You can see this quantitatively in this figure. Here, we are asking how much of the variability in where people segment can we account for with the motion of the body? This is data from three videos: in one he was folding laundry, in another he was building a model out of a construction toy, and in the third he was putting together a piece of furniture to hold a videogame console. For both of these, what you see is that there’s a substantial amount of the variance in where people segment accounted for by the movement features. In general, it’s greater for fine grained event boundaries than for coarse grained event

50

Lecture 3

boundaries. But it is statistically reliable for both grains for all the movies. So, in short, in these movies, when more of the movement features of the activity are changing, people are more likely to perceive event boundaries. Now, this account is very broad in the kinds of features that are proposed are relevant. It’s not just about the low-level motion properties. Some of the other features that people presumably attend to when they watch everyday activity are the sorts of things that we routinely talk about in language, things like causes and characters and interactions amongst characters, interactions of characters with objects, changes in goals, changes in spatial location. We can code all of these kinds of changes in narrative film. For this study, with the narrative film is a French film by Albert Lamorisse called The Red Balloon. It’s easy to code this movie for intervals in the movie in which things like the objects the characters interacting with are changing or the spatial location is changing. Here he goes from indoors to outdoors; that’s a location change. Here he picks up this balloon for the first time. That’s a change in object interactions. Then we can ask: Are our people more likely to identify event boundaries at these changes than the points that have fewer changes? Here is what you find. If we count the number of feature changes in a five-second interval and plot the likelihood that someone will segment as a function of the number of feature changes, what we find is that for both finegrained event boundaries and for coarse-grained event boundaries, there’s a robust increase in the probability of segmentation as the number of changes increases. So, the more stuff is changing, the more likely people are to perceive that a new event has begun. You can take a similar approach with sequentially presented images. These are pictures from a children’s picture book about a boy, a frog and a dog. It’s by Mercer Mayer. In this book, the boy goes on a bunch of adventures with the frog and the dog. You can present these stories as both pictures and text or just as pictures. They are pretty comprehensible without any text at all. So we can code how many features are changing from frame to frame in the story. In this analysis, we coded it for changes in spatial location or time (we had to group them together because in this book they tend to correlate; usually it cuts from one location at one time to a later time at a different location), changes in characters, and changes in affective reactions. So, when the boy wakes up and discovers that his frog is missing, then that’s a change in affect— he becomes very upset. Changes in the actor’s goals: after the boy discovers the frog is missing he takes up the goal of finding the frog. Again, what we find is that the more features are changing, the more likely we are to segment. Here they were grouped into zero to one feature, two features changing, or three features changing from panel to panel. This study

Event Segmentation Theory and the Segmentation

51

happened to include both younger adults and older adults. For both groups, you can see that the likelihood of perceiving an event boundary increased as the number of features changed. Recchia, Hard and their colleagues (2011) gave us a converging method for looking at segmentation using sequentially presented pictures. Here what they did is took several movies—this is a movie of a woman making a bed, this is a movie of a woman doing something in an office, I can’t remember what these two are, but four everyday activities. They took the movies and made them into slide shows by sampling the movie at one-second intervals. Then they let people experience the slide shows by pushing a button to advance to the next slide. So this is kind of like a self-paced rapid serial visual presentation reading paradigm, but instead of advancing from word to word to word, you are advancing from picture to picture to picture. The pictures are pulled out at one-second intervals. They are presented as a self-paced slide show and we can record the viewing time for each picture. Then after doing the self-paced viewing, each person segmented the slide show to indicate boundaries between meaningful events. In this study they were asked to segment three times, once a fine grain, once an intermediate grain and once a coarse grain. The measure of change that they used in this study is very different [[from]] either of the two that I described before. Here they just calculated the difference in pixel intensity at each point in the picture from frame to frame. If you have seen a point here on her elbow that goes from dark to light, that would be a big change and more of those changes there are, the higher that variable is. If the actor is moving quickly or if an object is seen moving quickly, but there will be a lot of pixel change. If the actor is basically sitting still, then there will be little pixel change. So they ask how well this variable predicted their segmentation judgments and also their viewing time. This is an example of the viewing times for one person watching one of these slide shows. I put this up just to note a practical thing about using this method: When people are watching a slide show of everyday activity like this and pacing themselves, they tend to go quite slow at first as they are learning about the environment of the story, and then they get faster and faster as they go. So as a matter of preprocessing these data, the first thing that [Hard et al.] do is [[to]] fit an exponential function to the whole curve, to remove this big initial learning effect. Then they can look at these small ups and downs, and ask “are people taking longer when there are more features changing?”. Then they can also ask are they more likely to segment when there are more features changing. Here is what you see. These are the first segmentation data. The way they presented them—I really like this way of presenting it—they time-lock the

52

Lecture 3

data to the point which the person said “Yes, this slide is an event boundary”. And they ask how many features are changing at the event boundary, one slide before the event boundary, two slides before the event boundary, three slides before the event boundary and then one slide after, two slides after and three slides after. What you can see is that for fine-grained segmentation, intermediate and coarse-grained segmentation, the height of feature changes is right at the event boundary. And there is an increase in feature change as you head up and a decrease as you go pass the event boundary. As you approach an event boundary, there is more and more change happening. As you pass the event boundary, the amount of change decreases again. In other words, those points that I identify as event boundaries are spikes in feature change. Here are the data for how long people spent looking at the slides, again time-locked to the slide on which they identified as an event boundary. What you see is that the viewing times are longest for those times that are identified as event boundaries. Again, there’s a basically a rise up and then a fall off. This point is a little bit unnormalized. You can see this both for fine grained segmentation and for intermediate and a little tiny bit for coarse grained segmentation. So when the pixels in the images are changing rapidly, people perceive event boundaries and they slow down as they are comprehending. I want to say one thing about prediction error in the context of these visually presented narratives. According to event segmentation theory, the reason that events are segmented at points of high feature change is that feature changes lead to prediction errors. Before I leave off describing how people parse visually presented events, I’d be remiss if I didn’t tell you about one test of that hypothesis. I am going into prediction error and its function in this kind of similarity and a lot more details in Lecture Six. But I want to leave you with just one example of this. Again, event segmentation theory says that the reason that people segment when there’s a lot of features’ changing is because that cause them to experience a spike in prediction error. One way that we’ve tried to test this is by just explicitly asking people to predict what’s going to happen next in a movie. Observers in these experiments watch movies of everyday activities like this one. This [[shows movie]] is a movie of a woman washing her car. Then from time to time, we would pause the movie and show them two pictures. One of these pictures depicts what’s going to happen in five seconds in the movie, and the other is selected from an alternative take where she was waxing her car. How many would vote that it’s the picture on the left that shows what’s going to happen? How many would predict it’s the picture on the right that shows what’s going to happen? It’s the picture on the right. Well done.

Event Segmentation Theory and the Segmentation

53

I don’t know about you, but when I do this task it looks really hard to me. I feel I am guessing most of the time. But despite that intuition, people can do about 80% correct on this task, often about 90% correct. However, event segmentation theory says this task ought to be harder if five seconds from now is going to cross an event boundary. The design of the experiment is pretty simple. We pause the clip, we test them and then we restart the clip so they can see if they got it wrong. There are two types of trials that we are interested in. On within-event trials, five seconds from now is still part of the same ongoing event. In across-event trials, five seconds from now crosses an event boundary. The event boundaries came from a previous experiment, in which we had many many people segment these movies. So we knew that these were points at which lots of people perceived event boundaries. The basic hypothesis is that these trials should be more difficult than these trials. Here are data from three experiments in which people performed this task. What we see is that in all cases, people are more accurate within an event than across an event. They are also faster when they are predicting five seconds from now within the same event as compared to across an event boundary. If we change the task slightly so that on each trial rather than presenting two pictures, we present just one picture and ask you “Is this picture showing what’s going to happen next,” then we can, half of the time, present what really is going to happen in five seconds and, half of the time, present the alternative picture. Then we can use the mathematics of signal detection theory to estimate both how well people can discriminate what’s really going to happen next from what’s an incorrect prediction, and also how biased they are to believe that a picture does depict what’s going to happen next. When we do this, the measure of discrimination is called d-prime. Converg­ ing with the other paradigm, that discrimination parameter is higher when we are predicting within an event than when we are predicting across an event. Signal detection theory also gives us that estimate of bias. What we see is that the estimate of bias, or criterion, shifts at an event boundary. The interpretation of this higher criterion within an event compared to across event is that even when I show you what really is going to happen in five seconds. It just looks wrong and you are more likely to say “No, that’s not what’s going to happen”. You shifted your criterion such that something has to really look right before you accept it. So people are overall less confident than when they see what is going to happen next, it is what’s going to happen next. In the final experiment, we just stopped the movie and ask them, point blank, do you think you are going to be able to tell me what’s going to happen next? They rated it on 1–5 scale. And what you see is that they know that they

54

Lecture 3

are in trouble. They are less confident that they are going to be able to accurately predict what happens next. So at event boundaries people are less able to select the picture that shows what’s going to happen in the near future. They are less able to see that a given picture is or is not what’s going to happen in the near future. Whatever we show, then they think it looks wrong and they know that they are about to make a prediction error. This is pretty strong behavioral evidence that at event boundaries, activity is less predictable. I want to say one thing about how this might relate to language. I am going to turn to these things in more detail in Lectures Four and Six. But I think the kinds of prediction errors that we are talking about here in the domain of event comprehension are very much akin to the kinds of prediction errors that we encounter when we look at the unfolding of language over time. So in a garden path sentence, the system is making a prediction about various properties of the world that’s going to come next—its semantics, its grammatical class— and then, to the extent the mechanisms of the garden path theory are in play, it reaches a point which is going to recognize the prediction error we are going to interpret. The technique developed by Hard and colleagues, I think, is very nicely analogous to the way that we can use reading rate as an online index of processing difficulty. And I will show you data this afternoon in which we relate predictability and event boundaries to reading rate. Studies of language have given us some of the most precise neurophysiological techniques for eyeseeing prediction error during online comprehension—in particular, the N400 and P600 components of the electroencephalographic response to words. And I will say more about those in Lecture Six. More broadly, I want to suggest that when people comprehend narrative text, the semantic representations that shape their predictions about upcoming lexical items and syntactic structures are largely event models. There are some things that are language-specific, but there is a lot that depends on an ongoing construction or representation of the event. In the next lecture, I am going to make a case for this proposal by describing effects of segmentation on narrative comprehension that parallel those described today for visual events. That is what we will turn to in the afternoon. Thank you very much for your attention.

Lecture 4

The Segmentation of Narrative Events Thank you. I’m hoping that in this lecture I’ll start to make a down payment on making clear why I think this is relevant for the understanding of language. I want to start with mentioning just some of the considerations about event language that have struck me as particularly important. And I say this with great humility because I know that the people in this room for the most part know more about these matters than I do, but I just want to comment on a few of the purely linguistic things about event language that have struck me is really important. One is the technique that our language has in the tense/aspect system for carving up descriptions of how things unfold in time. Maybe the strongest example of this is Vendler’s (1957) characterization of the four classes of states, activities, accomplishments and achievements. We can distinguish between a state such as being tired and activity such as running, an accomplishment such as writing a report, and an achievement such as the moment of getting a joke. The tense/aspect systems in a given language systematize these different classes differently and so different languages have different tools to bring, to bear, to highlight states or activities or accomplishments or achievements. One thing that strikes me is that it might be best to treat achievements—these punctate moments of infinitesimal duration—as a separate class from the other three because of their instantaneous nature. But in those other three, we can ask whether the way that the tools that the language gives us to carve up the unfolding of things overtime affects people’s nonlinguistic cognition about events. One characterization that this group is no doubt highly familiar with is the force dynamic account of motion events proposed by Talmy and his colleagues, which notes that an important linguistic difference is the degree to which a language favors encoding information about the path of motion or the manner of motion in verbs. When I first was baptized into this literature, this All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982440.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_005

56

Lecture 4

distinction between verb-framed and satellite-framed languages was taken as gospel, but now in revisiting it I understand that there’s been a good degree of controversy about exactly how these play out across languages and the distinction is not as clear cut as it was first seen. Nonetheless, there’s clearly a difference between constructions in which the path of motion is included in the verb and constructions in which the manner of motion rather is included in the verb, and it’s certainly the case that some languages favor one construct more than the other. A couple of the kinds of examples have struck me about how cross-linguistic differences can play roles and can affect the roles that entities can play, with possible consequences for non-linguistic cognition. In this recent study from Phillip Wolff and his colleagues (2009), they compared Chinese and English to Korean. In Chinese and in English it’s equally permissible for animate and inanimate entities to play the grammatical role of causers, whereas in Korean, I am told, it’s odd for an inanimate entities to play the role of the causer. When they asked speakers of these languages to look at simple animations and to judge whether they consisted of one or two events, they found that this pattern was related to systematic differences in the communities as to whether they individuated the animation as one or two events: Things that would be described with a single verb construction in English and Chinese were more likely to be described as one event rather than two by these speakers. Another great example of how languages vary in how they can represent event structure comes from serial verb constructions. One empirical example that comes from Rebecca Defina and her colleagues (2016) concerns serial verb constructions in Avatime. So in Avatime, you can say things like that translate literally as “Then return ascend”, which glosses as They then climbed up again. So that’s a construction that involves two verbs back to back that form a compound structure that putatively refers to a single event. Here is another example: “Wash face give child” glosses as In order for them to wash the child’s face. The mold in the progressive form “Mold type-of-porridge throw put” glosses as “They were molding the porridge, throwing it to the old man”. So, you can take these serial verbs—usually just two but sometimes up to four in sequence— and describe a single event with this string of verbs. There’s been some debate in the analysis of serial verbs about the extent to which they are really conceived by the speaker as denoting a unitary event as opposed to being represented in a compound structure but then just being output as a serial list of verbs. Defina got the brilliant idea to look at co-speech gestures as a way of getting a window on the non-linguistic component of the speaker’s conception. This is a still frame from a video of a woman telling a piece of Avatime folk history. The meaning of her utterance is “Then they

The Segmentation of Narrative Events

57

shared the porridge out”. The literal words are translated as “then”, “separate”, “put on a flat surface”. Separate and put on a flat surface is single serial verb construction, and the axis here marked the onset and offset of this gesture that she’s making to indicate the act of sharing it out. And that gesture spans the two verbs, the verb that translated as separate and the verb that translated as put on a flat surface. Defina’s (2016) idea is that if you see gestures that span a serial verb construction, that’s evidence that it is being treated as a single unit, whereas if it’s being really treated as a complex that’s merely output as a sequence, then you ought to at least sometimes see separate gestures for the separate components. Let’s look at a few other examples. This is a longer utterance with two gestures in it. That means “So they came back and some of them reached some place”. The first set of verbs is “return”, “come”. That’s a serial verb construction and the co-speech gesture spans both the verbs. The second one “reach”, “reach some place”. These two verbs are separate, not a serial verb construction. And here you have a co-speech gesture that only spans one of the two. So we can divide up the utterances into serial of verb constructions, other complex verb phrases, simple verb phrases, and nonfinites. We can characterize the gestures as cases where you have a single gesture that overlaps all of the verbs, a single gesture that overlaps part of the verb sequence, multiple gestures are over the course of the verbs, or gestures that are not on the verb, or utterances that don’t have any gestures at all. And the key thing is you never ever in the corpus see a serial verb construction with multiple gestures within the string of verbs. I think that’s really strong evidence that at early cognitive stages of forming the conceptual message, this is conceived as a single event rather than a series of events that are merely concatenated in the output. These examples are meant to illustrate that language has a bunch of tools to encode events and different languages use these tools differently. Especially the Defina (2016) study highlights for me the value of combining linguistic and non-linguistic dependent measures to analyze the role of event structure in utterance formation. In this case, it suggests that event descriptions correspond to complex event representations with componential structure that are formed as a unitary whole prior to being handed over to the language production system. In my research, you know, “shared out the porridge” is a kind of complex idea, but we’re mostly interested in ideas that are much much more complex, like people telling extended stories for tens of minutes. And here once you get into multi-utterance descriptions of events, then the segmentation mechanisms that we discussed in Lecture Three really come to the fore. I think we can apply a lot of the same tools that we used to study non-linguistic event cognition to studying event structure in narrative language.

58

Lecture 4

I want to start by trying to make a point about the relationship between the propositional textbase and the event model. Remember yesterday we talked about Walter Kintsch and Teun van Dijk’s (1983) proposal that when we read a text we construct a multilevel representation that has the surface structure (the exact wording trace of words in the grammatical structure which is lost rapidly), the propositional textbase (the propositions that are asserted by the text which is a little bit more durable), and the event models which are the primary representation underlying comprehension and subsequent memory. The role of the situation model in comprehension was brought to the fore in psychology primarily by a series of experiments by John Bransford and his colleagues. Other players were active in this around the same time. But in the early seventies, Bransford and colleagues (1972) did experiments like this one. They had people reading sentences that included items like “Three turtles rested on a floating log, and a fish swam beneath them. “And then a test sentence might say” Three turtles rested on a floating log, and a fish swam beneath it”. If you saw this item on a subsequent memory test with some delay, do you think it would be easier or harder to reject? Easy to reject? I would disagree. I think this one would be harder to reject. At least it would be harder to reject than if you read this “Three turtles rested beside a floating log, and a fish swam beneath them”, and then you were given as a test “Three turtles rested beside a floating log, and a fish swam beneath it”. In fact, these two differ from each other, of course, only in one word. And in both cases, the test item differs from the studied item only in one word. Why am I arguing that this one would be easier to reject? Because this describes a different situation model than this. So even though in both cases we’ve changed exactly one word and exactly one proposition between the studied item in the foil, in this case, we’ve changed the situation model. Just to make this really concrete, I’ve made an animation. In the first case if we read this, the hypothesis is that we construct an event model with the spatiotemporal framework that includes in it a log. And on that log, there sit three turtles and beneath the turtle swims a fish, right? And so then if we read “Three turtles rested on a floating log, and a fish swam beneath it”, we’re golden, right? This matches that. And we’d say “Yeah, yeah, I read that”, even though that we didn’t actually read that. In the second case “Three turtles rested beside a floating log, and a fish swam beneath them”, again we have a spatiotemporal framework for the log in it. We’ve got three turtles next to the log. And we’ve got a fish swimming beneath them. And then we read this sentence [[“Three turtles rested beside a floating log, and a fish swam beneath it”]]. And this sentence, if this is the content of our event model, this sentence doesn’t match it, so it should be much easier to reject. We can recognize that the fish is in the wrong place and we’ve got to move it. And this really illustrates

The Segmentation of Narrative Events

59

what I mean going back to yesterday’s lectures about event models. You’ve got depictive bits of representation, isomorphic components of the representation. But the relationship amongst those components can be manipulated in order to do cognition and in this case in order to compare this foil to this target. Okay, let me give one other example. Let me just read out this paragraph. “Over to the Indian camp.” The Indian who was rowing them was working very hard, but the other boat moved further ahead in the mist all the time. “Oh,” said Nick. Uncle George sat in the stern of the camp rowboat. It was cold on the water. “There is an Indian lady very sick.” Nick lay back with his father’s arm around him. The Indians rowed with quick choppy strokes. Nick and his father got in the stern of the boat and the Indians shoved it off and one of them got in to row. Nick heard the oarlocks of the other boat quite a way ahead of them in the mist. The young Indian shoved the camp boat off and got in to row Uncle George. The two boats started off in the dark. The two Indians stood waiting. “Where are we going, Dad?” Nick asked. At the lake shore there was another rowboat drawn up. Okay, something is clearly not right with this paragraph. Despite the fact that I would assert that every word in it was written by a master of English prose. Does anyone recognize the author of these sentences? It’s Hemingway and this comes from one of The Nick Adams stories called The Indian Camp. But there’s something clearly not right and it’s just simply that I rearranged the order of the sentences. So now if we read them in the right order: At the lake shore there was another rowboat drawn up. The two Indians stood waiting. Nick and his father got in the stern of the boat and the Indians shoved it off and one of them got in to row. Uncle George sat in the stern of the camp rowboat. The point is that even though at the sentence level the previous is a perfectly coherent narrative, there’s super-sentential structure at the level of the event representation that is completely obliterated by scrambling the sentences. And that’s a fundamental component of what we process when we process narrative. Okay, so to remind us from Lecture Two, we’ve got a surface structure, a propositional textbase, and then an event model. I’m going to argue that we can have prose that is appropriate at the level of the textbase without being properly hooked up at the level of the event model.

60

Lecture 4

One way to describe this is as a distinction between cohesion and coherence. Cohesion is connections between text elements at the level of the surface form—words and grammatical structures. If we have overlapping arguments, if we have repeated words, if we have semantically closely related words, those all facilitate cohesion. Whereas coherence is connections at the level of the propositional textbase or the event model. The scrambled Hemingway actually has pretty good cohesion, but it has lousy coherence. In lots of situations, listeners and readers aim for a particular standard of coherence. They are monitoring to make sure that they are building structures in their minds that have a sufficient degree of integration and connectedness, that is to say, a meaningful succession of event models. This may not always be the case. In some cases, if the reader doesn’t have a good reason to be building a rigorous representation or if it’s just too difficult—a text that is beyond their ability—or they don’t have the appropriate goal, they may give up and have a quite low standard of coherence. But in many reading situations and particularly ones where we would say that the person really understands the text, they’re striving for a certain level of coherence and will regulate their processing to achieve that level of coherence. So, the big question I want to entertain is: How do language comprehenders construct a series of coherent event models from a text? And to try to answer that, we’re going to turn to Event Segmentation Theory. So, just to remind us from this morning … The theory proposes that as you’re experiencing an event either live or in a narrative, you’re constantly building predictions about what’s going to happen in the near future guided by an event model—a representation of what’s happening now. The key thing is that you update that model when you experience a spike in prediction error and at those points you gate information into the event model from the early stages of your perceptual and sensory processing and also from your long-term memory, from your schemas and from your episodic memory. So, the theory says that we update these models in a gated fashion when prediction error spikes. Remember, we saw this morning evidence that segmentation is happening all the time in viewers’ perception. We tend to experience event boundaries when more features of an activity are changing, and the theory says that this is because feature changes tend to produce prediction errors. Just like we can test the hypothesis in visual events that there are going to be more event boundaries when more visual features are changing, we can test the same hypothesis for a narrative comprehension. What I want to turn to now is a set of studies testing that proposal.

The Segmentation of Narrative Events

61

Remember, I described in the morning a coding of narrative film where we can identify changes and causes, characters, interactions amongst characters, interactions with objects, goals and space. There’s nothing magic about this particular list of features, but these are certainly features that viewers of a movie or readers of a text are likely to be monitoring and making predictions on much of the time. And remember when we coded those, we saw that the probability that you identify an event boundary increase systematically as the number of features changes within an interval increase. In narrative, we can do the same kind of coding. The narrative I’m going to be working with most here is the one that I introduced yesterday, One Boy’s Day. This is a narrative of a day in the life of this five-year-old boy Raymond from the moment he woke up in the morning to when he went to bed in the evening. We can take the sentences in the Raymond stories and code them for changes in a bunch of features, so the features that we used in this case are similar to those that we used for the movie: changes in causal interactions, characters, goals, objects, space and time. If you have a sentence like Mrs. Birch went through the front door into the kitchen, this is the initiation of a new causal chain; there’s nothing antecedent in the text that would cause that event. And it’s a change in spatial location. If you have a clause like pulled a light cord hanging from the center of the room, that’s a change in the object being interacted with. If you have the focus shift from Mrs. Birch to Raymond, that’s a change in character, and also it’s a mention of time. Time, it is worth noting, is unusual for a narrative text in these stories, because unlike any decent novel the Raymond stories don’t jump around in time. Remember this is a veridical record of an actual boy’s day told in the order of the events happen, so there’s no jumping forward or backward in time. But there are these occasional mentions [[of time]], temporal adverbials and other constructions, “immediately”, “for a minute or so”, “after”, so we coded those in the stories as well. The rest of these are all changes. In the experiment I want to describe, the narratives were presented to different subjects in one of three ways: either auditorily (we had a professional narrator read forty-five minutes words of episodes from the Raymond texts), or we presented them visually one clause at a time on a computer screen and they pressed the key to advance to the next clause, or we presented them visually printed out on a page and they were able to read a whole page at a time. We asked people to segment in each of these modalities. For the auditory modalities, it works just like segmentation in the movie-watching task: they just push a button whenever they experience an event boundary. For

62

Lecture 4

the clause-by-clause presentation, they would push a key when they felt that clause ended an event, and began another one. And for the visual continuous presentation, we had them mark with a pencil on the page where they experienced the event boundaries. And here’s what you get for all three presentation modalities. You find, first of all, that people are able to follow the instructions to modulate their grain of segmentation. So, if we ask them to identify fine grained boundaries, they identify more than if we ask them to identify coarse grained boundaries. So they’re making smaller units, that is to say, more boundaries for fine grain. And for all three, the more features are changing in a clause, the more likely they are to experience an event boundary. This is another dataset in which participants read and segmented the Raymond stories. Here what we’re plotting is the effect of each individual change on the likelihood that someone is going to segment for either coarse or fine grain. These are odds ratios. An odds ratio of one means that the change has no effect, an odds ratio greater than one means it increases the odds of segmentation, and an odds ratio less than one means it decreases the likelihood of segmentation. You can see that changes in characters for both coarse and fine segmentation have a big effect on the likelihood of segmentation, greatly increasing the likelihood that somebody’s going to segment. In this dataset, changes in objects showed bigger increases for fine than for coarse. Changes in space and causes showed bigger increases for coarse than fine. Time and goals were pretty close for both coarse and fine. But in general, for each of these, when the change happens you’re more likely to segment. I want to return to the boy/dog/frog stories that we talked about a little bit this morning. Recall that Joseph Magliano and his colleagues (2012) had people segment these stories based on just the pictures alone. Well, it turns out they also developed a narrative version of the stories that could be presented as just text alone. This is not Mercer Mayer’s original text, but it’s a text designed to stand on its own. So Once there was a boy who had a pet dog and frog. One night he went to sleep with his pets. But, when he woke up in the morning, the frog was gone. The boy was very sad. Just as we coded the pictures for feature changes, we can code the text for feature changes. And the codings that they used again were changes in space or time, characters, affective reactions, or goals and they broke the goals into subordinate and superordinate goals. I’m not too worried about that distinction here.

The Segmentation of Narrative Events

63

What we saw this morning is that in the picture stories, the more of those features were changing the more likely people were to segment. And just the same thing is true for the text stories. And in both cases, this holds equally strongly for older adults and younger adults. I want to turn to a set of narratives that Heather Bailey (2017) constructed when she was a postdoctoral fellow in my lab, that allowed us to manipulate the presence of two kinds of change: changes in spatial location and changes in characters. Here is an example of a narrative that has six different shifts over the course of the narrative. In the excerpt that I’m showing you here, there is one spatial shift and one character shift. The way these stories worked, they first mentioned a critical word that will be used for memory tests which I’ll tell you about in a few lectures. In this case, the critical phrase was from the basket by the front door. So, you read Jim picked up his keys from the basket by the front door. The basket was supposed to be a place for just keys, but his were always buried under everything else in there. Jim hated how it became a place to keep junk. From now on he would keep it clean, he vowed. He found his keys and walked into the garage. That’s a spatial shift, and if people are monitoring spatial location and updating their event models when they encounter a change in spatial location, then that ought to be an event boundary. And then the story goes on. In the memory experiments, we probed their memory right here. Later in the story, you read On the top shelf on the corner, Jim saw the box that his wife had conveniently labeled “Camping Gear”. As he pulled it down, the sleeping bags that have been piled on top fell down around him. “At least I won’t forget those,” he muttered as the last one bounced off his shoulder. Opening the tote, he found matches, fire starter, flashlights, camping dishes, and some random pieces of rope. Then, we have a character shift. Walking into the garage, Kathy laughed at the pile of stuff surrounding her husband. So, participants read these extended narratives. In the first experiment, they would just read this narrative continuously. And they would mark off event boundaries. There were three groups of participants, one was told just read for comprehension, the next group was told “after you read the story we’re going to ask you to write a profile of one of the characters, so pay attention to the

64

Lecture 4

information about the characters”, and the third group was told “after you read the story we’re going to ask you to draw a map of the environment described in the story, so pay attention to the spatial locations”. And we can then plot the proportion of observers who identified an event boundary at each line in the story. What I highlighted here in green, red and blue are the responses of the three groups. We’ve marked the places where there are spatial shifts or right after a spatial shift, character shifts or right after a character shift, and what you can see is that when there’s a shift either on that line or on the next line, many many people perceive an event boundary. So the spikes in event boundaries can occur at other points in the stories but they’re quite likely to occur in the shifts that we inserted. I should also mention that there could well be other shifts at these points that we didn’t code as shifts, because there’s lots of other dimensions in the story that we’re not trying to control. But the shifts that we manipulate clearly produce event boundaries. Further they are modulated not dramatically by slightly, but slightly by people’s instructions. Here’s an example of the spatial shift and you see that people who are in the spatial condition are more likely to perceive an event boundary than those in the character condition. Here’s a character shift and you see that the people in the character condition are a little bit more likely to perceive an event boundary than those in the spatial condition. It’s not a dramatic effect, but if you average over all of the stories, what you find is that compared to the no shift condition, either a character shift or a spatial shift produces a large increase in the likelihood that one’s going to segment. But further, that character shifts are more likely to be identified as event boundaries if you’re in the character condition compared to the spatial condition. Here’s a better way of saying it: Spatial shifts produce a larger increase in your segmentation if you’re in the spatial condition compared to the character condition, whereas character shifts produce equivalent increases in your segmentation whether you’re attending to the characters or not. So there’s modulation of whether you segment on a spatial shift depending on whether you’re attending to space or not. I also want to note that people’s treatment of character and spatial shifts in the controlled condition winds up looking, in this study and in the memory studies that we’re going to discuss later, very much like the character condition, and so it appears that monitoring characters more than space is a kind of default mode for processing the narrative text. So, event boundaries are experienced at shifts in narrative situational dimensions. This pattern is consistent across the lifespan. And to some degree which shifts matter varies with temporal grain. And to some degree which shifts matter varies with readers’ goals.

The Segmentation of Narrative Events

65

Let’s see where we’re at on time. I think this would be a good place to stop for a quick pause. If readers are spontaneously segmenting narrative text as they’re reading it, then we ought to see footprints of that segmentation on other aspects of their reading behavior. The one I want to focus on here is reading rate. Hopefully I laid the groundwork for this a little bit in the previous lecture by introducing the technique of Bridgette Hard and her colleagues (2011), where they presented slide shows made from a movie and looked at the rate which people advanced through the slide show. In narrative text, it’s reading as compared to visual comprehension. It’s really easy to get rich measures of people’s rate. We can present things a word or a clause at a time on the screen and allow them to pace through, or we can let them freely view a text and measure their eye movements to see how quickly they’re reading it. Here is a study from Rolf Zwaan (1996), in which he asked people to read a story one sentence at a time, and simply push the key to advance to the next sentence. This one is called THE GRAND OPENING and it says, Today was the grand opening of Maurice’s new art gallery. He had invited everybody in town, who was important in the arts. Everyone who had been invited, had said that they would come. It seemed like the opening would be a big success. At seven o’clock, the first guests arrived. Maurice was in an excellent mood. He was shaking hands and beaming. And then they read either A moment later, an hour later or a day later, he turned very pale. He had completely forgotten to invite the local art critic. And sure enough, the opening was very negatively reviewed in the weekend edition of the local newspaper. Maurice decided to take some Advil and stay in bed the whole day. If you measure reading time for sentences that contain one of these three phrases, what you see is that the time to read a sentence with a moment later is considerably faster than the time to read sentences that say either an hour later or a day later. And my interpretation is that in this case [[moment later]], there’s no change in the time feature. It’s still the same event model. Whereas in these cases [[hour or day later]], you’re experiencing a prediction error, you know less about what’s coming next because time has changed and so you’re opening up the gates into your event model and that increases processing, takes time, and slows people down.

66

Lecture 4

Here’s another example from Gabriel Radvansky’s lab. This is a paper led by Kyle Pettijohn (Pettijohn et al., 2016). THE FARMERS REBELLION. This is a story about a plot to blow up a city council building. It’s an analogy to the Guy Fawkes plot to blow up British parliament. And the story goes like this. One difficulty was that the explosion might kill friendly pro-farmer members of the city council. Tess was particularly anxious to warn his brotherin-law, Jim Thorn. On July 26, Thorn showed a letter to the mayor’s lawyer, Steve Flett, who in turn showed it to the mayor. They decided to search the court house and the adjoining buildings. The search was conducted on August 4th, first by Deputy Williams, who actually encountered Fields in the cellar, and saw the piles of blankets, and that night by Todd Billings, a Pitman county sheriff, who discovered the boxes of explosives and arrested Fields, who, under interrogation, confessed and revealed the names of the conspirators. The other conspirators fled from Pitman but were rounded up in Fair Lake. Collins and Crawford were killed, and all of the others that were involved were tried and jailed in November. They coded this narrative for shifts in spatial location, time, either absolute or relative entities and causes. For example, just to pick out a couple of them, They decided to search the court house is a change in space. The search was conducted on August 4th is a change in time. They went through the narrative and coded all of these changes and what they found is that changes in absolute time, in entities, causes and goals, all were associated with slower reading of this continuous narrative. So, in a much more complicated narrative situation than the Zwaan’s example, people slow down when more things are changing and remember they’re also likely to identify event boundaries at the points where things are changing. Interestingly, in this study, people actually sped up for spatial shifts, and this has been found in other studies. I’m going to hold onto that thought and come back to it in a second. Here is another study from Zwaan and colleagues (1998). This is an experiment based on a set of materials and a paradigm developed by Gordon Bower, Daniel Morrow and their colleagues (1987). In these studies, what you do is first [[to]] memorize the layout of a lab building—this building has a reception room, a conference room, a laboratory and a library. And you repeatedly study this map and are tested until you can identify the locations of all the objects. So, if I ask you which object is the table and you can tell me the conference room. Then you read a narrative like this.

The Segmentation of Narrative Events

67

Wilbur regretted the day he had ever become the head of the research center. He had just found out that the board of directors was coming for a surprise visit the next day. He called all of the employees together in the library and told them the center was a complete mess. He told them to start cleaning up the building immediately. He said he wanted the directors to see a spotless, organized center. He told everybody to spread out and clean every room. He made sure the library was being cleaned and then left to supervise the rest. First he walked from the library into the reception room. He told the secretary to cancel all appointments for the day. He looked for reports on the current projects. Then he walked from the reception room into the conference room. I don’t know about you but this is what my life is. A scientist is like always cleaning the reception room. They coded this narrative for changes in temporal location, space, causation, goal and protagonist. And they asked what effect those changes had on people’s reading rate. And here the way they analyzed this was using a multiple regression analysis so that they could ask which of the features had statistically reliable effects on reading rate while controlling for the other features. What you find is, for the group that studied the map, time, space, causation, goal and protagonist all significantly slowed people down. Whereas in the previous study, spatial shifts sped people up, here spatial shifts are slowing people down. They can also use this technique to control for other things that might have a big effect on people’s reading rate, including the number of syllables, the sentence positions, the introduction of new arguments, word frequency, an argument overlap with the previous clause. Critically, a second group who read the stories without studying the map showed significant increases in response to time, causation, goal and protagonist, but no effect of spatial shifts. This is consistent with a hypothesis that people maybe aren’t routinely tracking space when they’re reading these kinds of narratives. They’re tracking stuff like characters and goals. If you make me memorize a map or think that I’m going to have to draw a map after reading a story, then I’ll track space, but maybe not under normal circumstances. That’s the contrast I want to call attention to. In this study as well, the more shifts that you experienced, the more reading time increased. That’s important. Time, goals, causes and protagonists are consistent predictors of increases, whereas space, its effect depends on what your reading goals are and what your experience with the map is. And consistent with the effect that I showed

68

Lecture 4

you before where reading with the intent to draw a character’s sketch or a map affected where people experienced event boundaries, this suggests that how we process these shifts depends on our reading goals. An important question to ask is how shifts on multiple dimensions combine to produce effects on reading rate. In this study by Rinck and Weber (2003), participants read narratives that had critical senses that could either shift the characters, the location, or the time, or two of the three, or all three of the three. They varied these things parametrically independently. Here’s an example. This is an experiment conducted in Germany with the materials presented in German, so this is a translation. Ever since Paul and his wife Frieda had retired, they put a lot of effort in their little house and their garden. On a sunny day in May, they decided to do a big spring cleaning including everything involved. In the fully continuous version, they read Frieda started to tidy up the house, whereas Paul took care of the garden, which he had declared his territory. In the warm midday sun, he started to carefully clean up last year’s withered leaves. Then he dug up the beds, sowed beans and potatoes, and he planted pansies and tomatoes. In the evening, Paul fetched his new lawn mower, which he had gotten for Christmas, from the shed, and he cut the small lawn in the garden. It was twilight already, so he had to hurry in order to finish mowing the lawn in time. If you read this version, by the time you finish it’s twilight, Paul is out in the garden mowing, and the focus is on Paul. So if you read this target sentence after that, In the last daylight, Paul stood in the garden and looked around satisfied, there should be no changes. There’s not a shift in location. There’s not a shift in time. There’s not a shift in character. On the other hand, consider this version, Paul took care of the garden, which he had declared his territory, whereas Frieda started to tidy up the house. The morning air was pleasant and refreshing, so she opened all windows and let spring reach every corner of the house. Then, in a spirit of adventure, she climbed up to the attic. There she searched old boxes and shaky cupboards, and she checked for mice. At noon, she cleaned the winter dust out of the hall and tidied up her beloved cabinet. The noon sun was quite warm already, so she

The Segmentation of Narrative Events

69

interrupted this work for some time while she closed all windows to keep the house pleasantly cool. If you read this setup, now we’re focused on Frida, we’re up in the attic, and it’s noon. So if you read this sentence, you’ve got a shift on all three of those dimensions. Here’s what they find. If you compare a temporally continuous continuation to a temporal shift, you find that people read more slowly if there’s a temporal shift. And they do that whether or not there’s a spatial shift or not, and whether or not there is a character shift or not. If you compare the brown bars to the blue bars, you find that in all cases, pretty much, there’s a slower response time when you shift characters than when you don’t. It’s maybe marginal here where there are the spatial shifts. But if you compare the spatial continuity condition to the spatial shift condition, you find that there is not much of an increase in reading time as a function of a special shift. Okay. So, shifts in characters and time produced reliable slowing but—consistent with what we saw here (just remember our condition where people are not studying a map or intending to produce a special description)—we find that shifts in space produced marginal changes in people’s reading rate. Okay. For another example, I’m going to come back to the Raymond stories. Here we can look at the effect of shifts in causes, characters, goals, objects, space and time on reading rate. We see that consistent with what we’ve seen in the previous studies shifts in causes, characters, goals and objects, all produce increases in reading time, slowing in reading rate compared to a baseline. These change effects are estimated using a regression model as in the Pettijohn et al. (2016) study, and also consistent with what we’ve seen in one of the previous studies, shifts in space actually lead to speeding. In this study, inconsistent with most of the other data, shifts in time didn’t have any effect. I want to point out that this is a stimulus for which we also have segmentation data, and it turns out that these shifts, the same shifts that produce slowing and reading with the exception of space, also produced the experience of event boundaries. And if we look at the effect of event boundaries on reading rate, we find that people slow down for sentences that have event boundaries in them even after we control for the effect of shifts. This is consistent with the idea that what’s happening is that the shift is leading to the experience of an event boundary, which leads to updating which slows you down. Even if there’s not a shift there, if you experience an event boundary you’re going to slow down. And even above and beyond the effect of the shift itself, the effect of the event boundary slows you down.

70

Lecture 4

One more example. Here is a set of narratives that Nicole Speer and I (2005) constructed to manipulate the presence of temporal shifts. This is a story about a woman going back packing and it says, She had just bought a new camera, and she hoped the pictures would turn out well. She could hear water running, and figured there must be a creek nearby. It was fairly warm outside during the afternoon, but Mary knew it would be much cooler during the night. She had brought an extra blanket this time. Although she had not liked carrying the extra weight on her hike, she knew she would appreciate it later. Now these are in a little bit smaller font, because we varied in the experiment whether they saw none of these, one of them or the whole thing. And we used that to vary the delay between the presentation of this word creek and a later memory test in some of the experiments. I’ll tell you about the memory data later. For now, what I want to focus on is the time it takes people to read this clause here. So they read either a moment later, she was collecting wood for a fire or an hour later, she was collecting wood for a fire, and we predicted that— consistent with the results of Zwaan (1996)—people would read this more slowly. In one experiment, we measured directly people’s perception that a new event had begun, and in a second experiment we measured how long it took them to read those clauses. Here is the data from those two experiments. What you see is that there’s a small but significant increase in the likelihood that they’re going to identify an event boundary, as a function of that single word change. You also see that there’s a significant increase in the time it takes them to read the clause. So, compared to a moment later, an hour later is more likely to be judged to be an event boundary and is read more slowly. So, in sum, readers slow down at situational shifts. This is associated with the perception of event boundaries. And, importantly, the effects of some shifts, in particular spatial shifts, depend on readers’ goals and attention. In terms of event segmentation theory, we think that the collection of features that you’re monitoring and making predictions on and an experiencing prediction error about is not fixed and immutable, rather it depends on your attention and your goals. At any given time, there’s an indefinite number of features that I could be attending to. I could be attending to the amount of breeze blowing across my face right now and making predictions about how breezy it’s going to be in ten seconds, but most of the time that’s probably not something that my attention is devoted to. And it turns out that for narrative reading, it’s probably the

The Segmentation of Narrative Events

71

case that most of the time not much attention is devoted to tracking spatial location. What we care about is characters and what they are doing. Okay, so there’s some evidence here that event boundaries are not simply the sum of the situational shifts in a narrative, because we saw that reading rate declined even more for event boundaries than for equivalent sentences that had shifts but were not event boundaries. I’m going to show you more evidence that there’s more going on at event boundaries than simply processing of shifts. In particular, Christopher Kurby (2012), when he was in the lab, asked whether readers make more contact with the situational features of the text at event boundaries than at points inbetween event boundaries. If a reader is updating her or his event model at an event boundary, then they ought to process more of the situational features at those points, other things being equal. And to test this, he used a really, in my view, creative method. This was a new one for me. He asked people to read narratives one clause at a time and to tell us what they were thinking after reading each clause. It’s a really strange way to read—you’re telling me more words than you’re actually reading. But people can do it and they don’t seem to mind doing it. They read a text, one clause at a time, and after reading each clause, they type what they’re thinking about. And like I said what they type is more than what they read. And then after doing that, we ask them to segment the text once at a fine grain and once at a coarse grain. What they read were four extended excerpts from One Boy’s Day. We can code their think-aloud protocols based on what they’re mentioning. So here’s a case where someone reads Mrs. Birch chuckled with slight embarrassment and what the person types is Mrs. Birch was embarrassed too. So they mentioned the character. Then they read and put them on and they type he wants to wear the shorts. So they’re mentioning a character Raymond, an object the shorts and a goal he wants to wear them. Okay. They go on and they do this and we can ask what’s the relationship between what they’re mentioning here, what’s actually mentioned in the text and whether they’re experiencing an event boundary. Here’s what you find: People mentioned features more when those features change in the text. They talk a lot about characters in their think aloud protocols, but they talk more about characters when there’s a character change than when there’s no character change. That makes sense. When new information is introduced, people mention it in their think aloud protocols. The same is true for objects, and space, and time, and goals but not for causes. However, it’s also the case that people talk more about several of these situational features when they’re reading at an event boundary compared to when they’re reading in an event middle. Compared to event middles, if they’re

72

Lecture 4

reading at a fine or coarse event boundary, they are more likely to mention characters, time for coarse boundaries but not fine boundaries, goals for fine boundaries but not coarse boundaries, and causes for fine boundaries but not coarse boundaries. In general, they are more likely to mention features of the situation at event boundaries compared to event middles. Now, is this just because event boundaries tend to be the place where more stuff is changing? And the answer is no. We can look at this by asking what happens if there’s a change in say, time. Am I any more likely to mention characters if a change in time produces an event boundary? If so, it’s not the change in time itself that’s producing the mention. It’s the event boundary. And so here what we’re plotting is the likelihood of mentioning a feature that did not change. And what you see is that at event boundaries, people are more likely to mention characters, time, goals, and causes, looks about the same as it did in the overall data, even when those features didn’t change. So it’s not just that people are mentioning the stuff that’s changing, they are mentioning more stuff about the situation at event boundaries. Whereas during event middles, they tend to mention more things about their inferences, their reactions, how it makes them feel, what they’re thinking about. They check in more with the story when they get to an event boundary, which is consistent with the idea that what they’re doing at that point is updating their model based on the information in the text. To wrap up, what I want to say is that segmentation during reading has really striking parallels to segmentation during visual event perception. And we can use many of the same tools to analyze people’s reading behavior that we’ve used to analyze their segmentation of visual narratives. What I’m going to do in the next lecture tomorrow morning is look at both visual and narrative event boundaries, to try to give an integrated exposition of what we know at this point about the neural mechanisms by which people segment and update in events, as I said in both visual and narrative experiences. I’ll stop there and we can go to questions.

Lecture 5

Neural Correlates of Event Segmentation Let me start with a few words about the use of functional neuroimaging for cognitive science and linguistics. One powerful use of neuroimaging is to study the anatomy of the brain, and I have many colleagues who are fundamentally interested in what the parts of the brain are, how they relate to each other, what are the neurotransmitter systems, and what are the small- and large-scale connectivity of the brain that is characteristic of typical human development or disease or injury. I’m fundamentally not interested in that stuff. I find it fascinating. But I’m interested in brains because of what they can tell me about perception and memory, and language processing. Neuroscience has had a powerful effect in terms of giving us new tools in my field, and I see increasing evidence that neuroimaging techniques are having the same effect in psycholinguistics and linguistics. What I want to share today are some approaches to the questions of event perception and representation from visual perception and from language that we’ve been talking about over the last few days, and show how neurophysiological methods and in particular, functional magnetic resonance imaging, can be one additional tool that can aid in our quest to understand how events are represented in the mind. I want to start by characterizing what happens in the brain when one experiences an event boundary in visual perception and in reading. The reason that we’ve been looking at this is because our theoretical model of event segmentation specifies particular computational mechanisms that take place at an event boundary. According to event segmentation theory, at an event boundary, you experience a spike in prediction error and a gating in new information into a set of event models, and then a process of settling back down into a new stable state. And that predicts that in systems that are responsible for prediction error monitoring, the maintenance of event models, and the gating, you

All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982443.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_006

74

Lecture 5

should see transient and stable effects on the nervous system around event boundaries. And so we and others have been looking at those effects. As I said, the primary technique that we’ve been using in my laboratory is functional magnetic resonance imaging, or fMRI, and so I’m going to spend just a couple of minutes talking about what fMRI can provide to us as cognitive scientists. Many of you will have had experiences with MRI scanners. They’re becoming increasingly ubiquitous for medical diagnostics. The MRI scanner is basically a large magnet with a set of rapidly switchable gradient magnetic coils that allow one to manipulate and then read off the paramagnetic properties of a set of tissue, basically to characterize how that tissue responds to changes in magnetic fields. When we do medical imaging, often what we’re interested in is the relative density of tissue. We want to be able to tell bone from muscle, from vessels, from organs, and so we’ll often want to use the magnetic properties to encode the local density of water which can tell us about what kind of tissue is there. In functional MRI, what we’re interested in is properties that change over time as a function of neural activity, and one of the ones that is a great signal is the local presence of the deoxyhemoglobin in the blood. If you use a muscle in your body, your body will route extra oxygenated blood to that muscle. If I were to do curls with my arm, my vasculature would route more oxygenated blood to the arm. And if we were to image that with an appropriate pulse sequence in an MRI scanner, we would find that there is an increase in the signal locally, in highly oxygenated blood. The body is compensating for the extra demands of that work by routing more oxygenated blood there. The same thing turns out to be true on a very local scale within the brain. If one part of your brain is more active, the vasculature will route more oxygenated blood there. Deoxygenated hemoglobin turns out to be an excellent MRI contrast agent. Deoxyhemoglobin disrupts the local magnetic properties, so we can see the local concentration of deoxyhemoglobin. When we rout extra oxygenated blood into the area, the local concentration of the deoxyhemoglobin goes down, and we can see that with the scanner. Some of the important properties that result from imaging in this way we mentioned briefly in Lecture Two. First, it’s not invasive. We can put healthy participants in the scanner and image them, and it does them no harm. And we can image them repeatedly. I’ve been scanned hundreds of times. It measures the relative local concentration of deoxyhemoglobin which has been demonstrated in a large number of physiological studies at this point with local neural activity. The signal that we look at is usually referred to as the blood oxygen level-dependent (or BOLD) signal. When a local brain area is activated,

Neural Correlates of Event Segmentation

75

we have more oxygenated blood there, and deoxyhemoglobin is displaced by oxyhemoglobin. That’s the change that we see. Now this response starts actually quite quickly, but then it evolves over a relatively long period, and so one of the big constraints of using fMRI is that the signal is essentially highly blurred in time. Now, that doesn’t mean that we can’t resolve things that happen quickly, but it means that we’re going to have to be clever in terms of the statistical analyses that we use to pull apart signals that are going to be superimposed on each other. I’ll illustrate that in a moment. We can sample with current pulse sequences faster than once per second, but with a reasonable image size and reasonable contrast, sampling about once per second with voxels (imaging elements) that are three millimeters on a side is about state of the art. This picture I think is really instructive for thinking about what MRI gives you. This is from a study by Geoffrey Boynton and colleagues (1996), in which they stimulated the primary visual cortex by having people watch flickering checkerboard patterns that drive the neurons in primary visual cortex really aggressively. On a given trial, the flickering checkerboard pattern was on for either three seconds, six seconds, twelve seconds or twenty-four seconds. And then they imaged for about forty seconds. This is illustrating the duration of neural stimulation of primary visual cortex. And this is illustrating the fMRI BOLD response coming out. What you can see is that the signal doesn’t even really start rising until after that three second interval is over. And it peaks about five or six seconds after the neural activity. The neural activity is represented by this yellow stripe. The fMRI response is represented by this curve. So the fMRI response to a brief stimulation is going to peak about five or six seconds later. If we concatenate two three-second stimulations together into a six-second stimulation, we don’t see very much difference in the beginning of the signal, but we see a broadening of the response, this is exactly accounted for by superimposing a curve like this, with another curve like this starting three seconds later. So these things sum together. If we stimulate for twelve seconds, we could think about that as superimposing four of these curves, offset each by three seconds, and we get this curve. And finally, if we stimulate for twenty-four seconds, we could think about that is stimulated of as superimposing eight of these curves, and now you see a response that approaches its steady state value. So by this point the signal’s not going to get any higher because the rises that are due to the later stimulation are being balanced by the decays that are due to the end of the effects of the earlier stimulation. So basically, the hemodynamic response in MRI is a big temporal blurring filter that smears the signal out in time. It’s quite precisely time-locked, so we can tear apart things that

76

Lecture 5

happen close in time. But we’re going to have to pull them out of these signals that are concatenated together, superimposed. With that in mind, the way that we initially went to look at the neural responses to event boundaries was by showing people movies of the sort that I’ve been showing you over the last few days, in the scanner. Here’s an example of one of the stimuli. This is a movie of a man washing dishes and this movie lasts about six minutes. People watched it and three others in the scanner. And then after watching the movie, we taught them the segmentation task that we’ve been talking about over the last couple of days. So just to remind, this is a task in which people are watching the movie again and they’re pushing a button whenever in their judgment one meaningful unit of activity ends and another begins. This is really important. On that initial viewing where they’re being scanned, these people have never heard of events segmentation. They’ve never heard of this task. They’re not thinking about events segmentation consciously. What we’re recording from their brains is what’s normally a component of their ongoing perception. Then we teach them about the segmentation task. And then in this study we scanned them again while they were doing fine segmentation, trained them on coarse segmentation, then scanned them again while they were doing the coarse segmentation. Across participants, we counterbalance the order of coarse versus fine, so half of the people get fine first and coarse second, the other half get coarse first and then fine second. This is just to give you an idea of what the raw data wind up looking like when we do these scans. This is one imaging element, called a voxel. It’s a signal over time after we’ve cleaned it up in pre-processing a little bit, while the person was watching that dish-washing movie. Suppose that this person had then going on to segment at these locations. What we’d want to know is, are there changes in time in the brain activity at those points where they went to segment? So intuitively what you want to do is draw a window around each of those time points, pull out the signal from those windows and average them together to get an estimate of the typical response around the event boundary. And if we’re looking at a part of the brain that has no consistent response to event boundaries, that should average away to a flat line. But if we’re looking at an area that has some consistent evoked response, then we’ll see that in the average. So that’s the essence of the statistical technique. Now, there are a number of confounds that one has to worry about. So rather than simply averaging, we fit a linear regression model that models some of those confounds. In particular, it allows us to deal with the fact that these things overlap in time. This window, for example, includes the beginning of one response and the end of the other response.

Neural Correlates of Event Segmentation

77

When we first went to do these studies, we didn’t actually know whether the signal would mostly precede the event boundary or mostly follow it. We drew a twenty-second window around the event boundaries and that’s what we averaged over. And what you find is that in a number of regions throughout the brain, there are evoked responses that don’t average away to being flat—that are consistent phasic responses at the event boundary. So here what I’m showing you is a set of axial slices through the brain going from top to bottom. This is as if you slice through this way, and the colored patches are indicating those locations in the brain at which there were significant evoked responses. And in another slide too, I’ll show you the same responses superimposed on the cortical surface and it’s a little bit easier to interpret. But what we consistently see across studies is a response in the right parietal cortex and responses in the posterior cortex in both the medial and lateral surfaces near the juncture of the parietal temporal perceptual lobes. I’ll say a little bit more about those areas as we go along. This illustrates what the response looks like in time. While people are doing the segmentation task, you see these large consistent responses—and note that these are peaking between five and ten seconds after the event boundary. If you compare that with the curves in the Boynton’s (1996) study, that’s consistent with the maximum of neural activity being right at the point at which they push the button. Throughout the network, we see this typical response where things peak about five to ten seconds after the person pushes the button. And the responses are almost always larger for coarse-grained boundaries than for fine-grained boundaries. (I’ll show you one exception to that.) However, there’s a problem interpreting these data, because when these data were collected, people were performing an active task and were thinking consciously about the segmentation of the activity, because we’d asked them to do that. We know that there are large neural responses from making decisions, planning actons, and pushing buttons. And these responses are quite likely compromised [[or]] contaminated by those other processes which are not the processes that we want to study. What’s the most interesting is the brain’s response during that initial passive viewing session. So here if we time-lock the button pushing data from these later scans to the brain data from the initial scans, we can look at the brain response uncontaminated by those other processes. And what you find, not surprisingly, is that in most areas the responses are smaller, because we’ve taken out those contaminating signals. But they have a quite similar timecourse and quite similar difference between coarse and fine, where we see larger responses for coarse grained boundaries and for fine grained boundaries.

78

Lecture 5

Here’s what these responses look like superimposed on the cortical surface. What I’m showing you here is a lateral view of the left hemisphere, if I’m standing and you’re looking at my brain from this side, lateral view of the right hemisphere, looking at my brain from this side, and then a medial views of the left and right hemispheres. If you took away my right hemisphere and looked at the left hemisphere from this angle that would be this one. And then this is the medial view of the right hemisphere. Again, these are these strongest responses bilaterally in the back of the brain and often stronger in the right hemisphere and the right side of the brain. Those movies of people doing things like washing dishes or washing a car or making a bed, they’re pretty simple and they’re pretty short. We’ve wanted to know whether these kinds of effects generalized to other stimuli. One of the other stimuli that we’ve looked at is The Red Balloon which we’ve talked about in the previous lectures. This is a forty-minute feature film, and again one of the useful things about it for these purposes is that it has no spoken, almost no spoken dialogue. There are a couple words in the movie and those words are in French and most of our participants are not French speakers. In this study we showed this movie in four pieces to our observers and recorded brain activity while they watched the movie. We can plot the areas of the brain that show significant responses at event boundaries. This is only during passive viewing—these are people who have never heard of events segmentation while they’re being scanned. And you can see this is a study with a longer interval of scanning and more subjects. We have more statistical power and we see responses in more parts of the brain. But you can see that it’s quite similar in the distribution of areas to what we saw in the initial study. These areas in the lateral parts of the junction of the parietal and temporal lobes tend to be the strongest responses. We always see this area of right frontal cortex and we often see strong responses in the medial parts of the back of the brain. And this is just pulling out the timecourse of response of a number of these areas to illustrate that as before you typically see a peak between five and ten seconds after. Now, one thing that is different about this stimulus and any of the others that we’ve looked at is that the responses are larger for coarse-grained boundaries and for fine-grained boundaries. It’s worth noting one possibility is that it has something to do with the more extended narrative structure of these movies. There are other possibilities. I think this is a very interesting topic to explore. It seems to me there’s got to be some point at which the responses will not keep getting larger and larger. That if you go to large enough event boundaries, there’s got to be an optimal scale after which things get smaller again. But nobody has yet systematically swept out what that optimal scale is.

Neural Correlates of Event Segmentation

79

In my lab we’ve mostly used this technique where we used the participants’ behavioral segmentation to characterize the dynamics of the response to events and event boundaries. Other labs have used converging approaches and one that I’m really excited comes from the lab of Uri Hasson and Ken Norman at Princeton. This by Christopher Baldassano. What Baldassano and his colleagues (2017) did was show people extended excerpts from two television dramas. Here we’re looking at one of them which is the BBC drama Sherlock. They had people watch Sherlock episodes and then what they did is extract time courses from small spherical collections of voxels. I’ve just schematized those with [[a green circle]] here. They zoom on a little patch of brain at a time and you pull out the timecourse of all of the voxels in that patch. So here they’ve illustrated as if there were just three voxels, but really it’s going to be a few dozen. Then, you fit a statistical model that looks for boundaries such that within a chunk the timecourses are as stable as possible, and across chunks the time course changes as much as possible. So, it’s looking for the transitions in the patterns in that little patch of cortex. If you specify a number of boundaries that you’re looking for, it will give you the optimal placement in time of those boundaries to best describe how the pattern is changing in that little patch of cortex. And then you can move the little patch over the whole brain and you can characterize where the optimal boundaries are for each patch of cortex across the brain. You repeat that for patches all over the brain and you also can repeat it [[by]] varying k, varying the number of boundaries that you tell the model to look for. And so you can get two things out of this model. One is: What’s the value of k that best accounts for the data? If I set k too small, then I will have big transitions within my events that are not accounted for by the model. If I set k too big, then I’ll have to put boundaries in the middle of what should be a stable state and so again my model won’t fit very well. You can ask for a given region of the brain how many events does it describe optimally—in other words, what’s the temporal scale at which that part of the brain is coding the stimulus? You also get out of it the locations of the boundaries. Okay, this approach, in contrast to the one that I just described, doesn’t require any behavioral segmentation on the part of the viewers. It’s just doing [[segmentation]] in a data-driven fashion based on the brain. When you do that, here’s one result that you get. What they are plotting here is the locations of optimal boundaries in a number of brain regions for an excerpt from one of the movies. Don’t worry yet about what these brain regions are. But this is critical: They’re comparing this to behavioral segmentation of the sort that we typically do in my lab. The behavioral segmentation was done completely independent of the brain analysis. But you can see there’s a really nice correspondence in most areas between where the observers behaviorally

80

Lecture 5

say that there are events boundaries and where the brain tells you that there are event boundaries. That converges with the kind of analysis that I showed you before. And then the other really interesting result that you get out of this is a description of which parts of the brain are tuned to fine-grained events and which parts of the brain are tuned to coarse-grained events. In this color map, the blue regions are short time scales and the yellow regions are longer timescales. If we look here at the medial surface of the back of the brain, this is the primary visual cortex, this is the first place that visual information projecting from the retinas hits the cortex. And you would expect that in those early visual processing stages there would be little event structure per se, rather the representations would be evolving rapidly as the visual stimulus changes. And that’s what you see: The timecourse of those areas is very short. As you move forward in the brain into what are often described as secondary and tertiary association areas, you get into parts of the brain that have much more complex response properties that have response is not just a visual information, but also to tactile and auditory information that respond to stimuli not just from vision but from language. In those areas of the brain, you see much longer timescales. So the optimal number of events for these areas here in the medial and lateral prior cortex is relatively few, whereas the optimal number of events to describe what’s going on in an early visual cortex is many. As we go deeper into the visual processing systems of the brain, we see longer and longer timescales of representations. The same thing happens as you proceed down the extent of the temporal cortex. These areas in the temporal pole are areas that are strongly associated with semantics including lexical semantics. People with lesions or brain disease affecting the temporal poles develop semantic dementia. What this suggests is that as you go farther and farther through these hierarchies, the representations become more temporally extended and more robust against the details of the visual input or the rapid changes that result from say changing a camera angle or from changing a lighting condition. So these are good candidates for representing the event content in a way that an event model ought to represent one. But when you think about it this way, one of the things that emerges is that a reasonable way to think about event models is not just as having one kind of content but having content that varies with the temporal grain of the event model, while the kinds of things that might be represented in fine-grained event models might be a little different from the kinds of things that are represented in coarse-grained event models. What I want to say here, summarizing this first section, is that there are large phasic responses at event boundaries during event viewing. These seem

Neural Correlates of Event Segmentation

81

to be a normal ongoing part of comprehension. We don’t need to give people a special task or special instructions in order to observe these responses. They are most evident in tertiary association areas. These multimodal convergence zones that are relatively far in neural terms from the sensory surface that reflect multiple stages of processing have occurred. The timescale of event representations increases as you move from visual or auditory processing areas to these convergence zones. Okay. For comparison now, I’d like to look at the evoked response to event boundaries in narratives. Just as we can compare the behavioral responses to event boundaries in visual events to a narrative text, we can do the same thing with the neural responses. Now, I’ll return here to the Raymond stories. As you recall, these are the descriptions of one boy’s activities over the course of a day as recorded by a team of twelve observers. And the way we presented these in the scanner was using rapid serial visual presentation, one word at a time. The timing of the presentation was based on our previous behavioral studies, in which we allowed people to self-pace. In the scanner, we didn’t want them to push buttons, so we needed to pace it for them. We arrived at a presentation rate that was designed to accommodate most of the observers that we’ve seen in the previous studies. And like most other people studying reading in this single-word method, we varied the presentation rate such that longer words get presented for a little bit longer, and gave a pause at the end of a clause or a sentence. People get actually really comfortable with this and they’re happy to do it. In this case they are doing this in the scanner for forty-five minutes, and they show good comprehension as measured by tests of comprehension after the image acquisition. Participants read four stories from the Raymond narratives in the scanner and then after scanning we had them segment. In this case we did not record brain activity during the segmentation. And then we can do just like we did for the movies, we can time-lock brain activity to the boundaries during the scanning session. When you do that, these are the areas that you see activated around event boundaries. Again, the hot spots of activity are bilaterally in the back of the brain. Here they are much more medial than lateral compared to the movies. But if you drop the threshold, you’d find that for both the movies and the stories, you have both the medial and lateral areas. There’s a little bit in the right frontal cortex which is hidden by this gyrus here. I’ll show you in a little bit a direct comparison, but the layout of these responses where they occur in the brain is actually quite similar to that we see for movies.

82

Lecture 5

And the timecourse that we see is also quite similar. If we look at the evolution of the response relative to the location of the boundary, again we get a peak about five to ten seconds after. And here like we did for the everyday activity movies, we see a much larger response for coarse-grained boundaries and for fine-grained boundaries. The stories that we used in this study were about fifteen minutes long. There are four separate stories lifted from the larger narrative, and the segmentation that we used came from the observers’ own segmentation. A converging approach was taken by Whitney and colleagues (2009), in which they presented a single narrative auditorily. They presented one longer extended narrative. And in this study, they measured responses not to the observers own segmentation, but to a segmentation that they proposed would characterize the event boundaries in the story, which was derived primarily from shifts in characters, time, location or action. The analysis that they did was to compare the phasic responses at these shifts to the level of activity at points that were chosen randomly. The idea in this kind of analysis is if there is a brain response to reading these shifts, then the signal should be higher at these points than at these points. And that’s what they observed. These data are depicted in a way I haven’t shown you. These are sagittal slices; they are oriented as if you are slicing through the brain from front to back, and they are moving from the medial part of the brain out to the lateral part of the brain, in the right hemisphere here, going from right in the middle out to the right hemisphere. Just to the right, just to the very middle and just to the left. What you see in these analyses is basically a single focus of activity that spans bilaterally in the back of the brain that corresponds quite nicely to that posterior activity that I showed you in the previous study. To summarize the similarity that you get across stimuli, this is comparing the very first study that I showed you to The Red Balloon study to the Raymond study. I’m just showing the right hemisphere here in each case. You can see that for all three of them you get activity in the lateral and medial parts of the back of the brain, and then in the right lateral prefrontal cortex. To me, it’s quite striking that the responses to narrative event boundaries are similar to those for visual event boundaries, and that for visual event boundaries you see quite similar responses across quite different stimuli. And again, in all cases the data that are of most interest are data from when people are just doing what they’re doing when they watch or read one of these stories. In those responses, we see footprints of the ongoing components of comprehension. In the domain of the behavior, we saw that one of the really interesting questions to ask is what the relationship between narrative changes, changes

Neural Correlates of Event Segmentation

83

in features of the situation and in the behavioral responses. We can ask the same set of questions about the neural response. And I think after a short break, let’s turn to that. We’ll take a little break here. Over the last couple of lectures, we saw how the behavioral responses to event boundaries could be accounted for in part by the effects of changes in dimensions of the situation that the person is watching or reading about. And it’s important and interesting to ask whether the same thing is true in the neural responses for the same reason. Event segmentation theory says that there’s a causal pathway from experiencing a change in the situation to experiencing an event boundary. The change in the situation tends to produce a prediction error. The prediction error induces updating of your event model and that’s the event boundary. If that’s the case, then two things follow. First, we should see phasic responses in the neural systems associated with event boundaries at situation changes. And second, those responses ought to mediate or account for part of the responses to the event boundary. If the experience of the event boundary is in part due to the processing of those changes, then we should be able to account for part of that neural response by modeling the effects of the changes. So, I want to describe a set of studies that aim to ask whether this is the case. Just to restate, event segmentation theory proposes that event boundaries are experienced because of spikes in prediction error. I should emphasize this one point: In naturalistic environments, prediction error spikes tend to happen when more situational features are changing. In principle, it should be possible to devise situations in which you have prediction error spikes without feature changes or feature changes without prediction error spikes. But if we look at the natural world, it tends to be that people make more errors when more things are changing. Thus, the neural correlates of feature changes should mediate the neural correlates of event boundaries. We can look at this by using the coding that we applied to the behavioral data yesterday and looking at the neural response. So, just to remind you: In stimuli like The Red Balloon movie, we can code the points in time in the movie at which there is a change in cause, change in character, change in interactions among characters, and change in interactions with objects, change in goal or change in spatial location. And just as we could time-lock event boundaries to those changes, we can time-lock the neural response. And when we do, we get this really dramatic image. For the moment, don’t worry at all about which brain areas are associated with which feature changes. Just note that a large proportion of the cortex is responsive to feature changes in this narrative film, and note also these pink

84

Lecture 5

regions. So that there are lots of areas that respond not just to a single change but to multiple changes. And if you look [[at]] these regions that respond to multiple changes, [[they]] tend to correspond with the regions that respond to events boundaries. So, a lot of the brain responds to feature changes in a stimulus like this. If we plot the evoked response to event boundaries in this movie as we did before, remember this is the one stimulus where you had larger responses to fine-grained boundaries than to coarse-grained boundaries. But note also here in the dashed lines, I’ve shown you the response to event boundaries after we’ve removed from the signal the response to changes. If we statistically control for the response to situation changes, what you find is that you cut the response to event boundaries about in half. This is consistent with the theory’s proposal that the reason that we’re experiencing event boundaries has to do with the processing of situation changes. We can again apply the same coding that we used in the Raymond stories to look at the behavior to the analysis of the brain response during reading of the Raymond stories. So here we coded again changes in cause, character, goal, object, space and time, and we can time-lock to those changes and again ask “Are there areas of the brain that are selectively responsive at those changes?” and “Do those selective responses account for the response to event boundaries?” Again, you get this very complex map of the response to changes. Again, you see large parts of the brain that are responsive to multiple changes. Again, they are in pink here and remember for the written narratives the strongest response to event boundaries was here in these medial posterior regions and that’s where you see the most pink again. We do the same kind of analysis where we calculate the response to event boundaries, first without accounting for these changes and then after accounting for these changes. And if we do that, we see the same thing as we saw for the movies. So here we have a larger response to coarse-grained boundaries than to fine-grained boundaries. And both of these responses are reduced by about a half if we remove the response to the situation changes. So, there are large effects of shifts in situational dimensions on brain activity. These are similar between visual events and narrative events. Some of these responses are general. Some of the brain areas respond to multiple situation changes, others are specific to a particular kind of change. I just want to mention, for those who were here for the second lecture, that those specific responses, I think, are really informative about the potential embodied nature of event representations in the brain. Those are the responses that I was focusing

Neural Correlates of Event Segmentation

85

on in that second lecture. For both visual and narrative events, the processing of situation changes accounts for about half of the response to event boundaries as predicted by the theory. I’m going to shift here a little bit and talk about event model construction. I want to isolate a question that really is a psycholinguistic question, and that is: When we look at the response of the mind in the brain during reading, what is it that’s special about reading extended discourse [[or]] coherent discourse narrative as opposed to the mechanisms of language that are at the sentential and lower levels that apply whether you’re reading about a coherent connected situation or not? With respect to the neuroanatomy, one early idea that came mostly from looking at neurological patients was that there is something going on in the right hemisphere that is special about discourse processing. And this was a really kind of nice elegant idea because, as you likely have heard many times, the left hemisphere in most right-handed adults appears to be dominant for language. So if you have a stroke or a brain injury affecting the left hemisphere in the temporal and frontal lobes, you will have aphasia, you’ll lose some of your language ability, whereas right hemisphere lesions are much less likely to produce an aphasia. And so the idea was that if the left hemisphere is responsible for things that sentential and subsentential levels, maybe the right hemisphere is responsible for the discourse level. So if they look at the corresponding regions in the right hemisphere, maybe that’s where the language deficits are. There were some patient data that seemed consistent with this, and there was one influential early functional MRI study from David Robertson and Morton Gernsbacher (2000) that seemed to be consistent with this. But as more data have accumulated, the picture has changed and gotten more complicated in some ways, but actually a little simpler in some ways. The meta-analyses of the new imaging data haven’t provided much support for the right-hemisphere idea. They have provided support for an alternative account, which suggests that there are areas in the medial parts of the prefrontal cortex that are especially important for event model construction. These are a subset of these tertiary association areas that we saw had long timecourses in the movie response data, and that are associated with pretty profound deficits of event understanding in people with lesions. We were inspired by these meta-analyses to try to do an experimental design that could look at this as cleanly as possible, and we did this using the Raymond narratives. I’ve shown you a little bit of data from the study in the previous lecture. The way these data were presented was by lifting coherent paragraphs out of the Raymond narratives. We chose paragraphs that each one

86

Lecture 5

told a little consistent story, and then we created paragraphs by lifting individual sentences from the intact paragraphs and scrambling the assignment of sentences to paragraphs. People read these in the scanner, one word at a time, using the method I showed you before and then afterwards we gave them a recognition memory tests for the sentences, and we gave them a test of comprehension asking them about what happened. Here’s an example of an intact paragraph and a scrambled paragraph. I’ll just read a little bit of each. This one is just a nice little story. As Raymond skipped down the aisle toward his desk, he glanced around the room. Whenever his glances met those of the other children, his face lit up in a friendly greeting. At his desk he paused, as if undecided whether to sit down or to find something else to do. So, a perfectly mundane description of everyday activity. In a scrambled case we might have something like this: Mrs. Birch called in a pleasant tone, “Raymond, take a bath and then you can go to bed.” Raymond noticed this immediately and asked curiously, “Am I four feet high?” He stood and went toward them in a slow, jogging run. Raymond stopped briefly in front of Sherwin’s furniture store. He turned the book and tilted it so that Gregory could see. Clearly, you can’t build a coherent situation model out of that second one. In the scanner, participants were told explicitly what kind of paragraph they were going to read. So, before each paragraph, they got a cue that said this is going to be an intact story, [[or]] this is going to be scrambled so that they knew they weren’t going to be able to form a coherent situation model here. And then we can look at the responses as people are reading these stories. First, what I want to show you is just a global map of areas that increased or decreased in their signal while they were reading compared to intervals in between the paragraphs where they’re just staring at a fixation cross. What you see is throughout the left hemisphere language areas and many of their right hemisphere homologues, you see large increases. This corresponds to Broca’s area. This is an area that if you have a brain lesion, you experience a profound aphasia which is characterized by poor ability to form grammatical utterances, but decent comprehension of the meaning of sentences. Whereas if you have lesions in this area, Wernicke’s area, it’s characterized by a fluent aphasia in which people produce grammatical sentences and often are quite verbose

Neural Correlates of Event Segmentation

87

but have very little semantic content to their utterances and have profound difficulties understanding what people are saying or behaving appropriately in response to words. So you see increases throughout the classical language regions and in their right hemisphere homologues and then also in these medial frontal areas. And then you see decreases throughout a set of regions that correspond to what’s come to be called the default mode network. This is a set of areas that tend to decrease whenever people engage in focal cognitive activity and there’s been a great deal of interest in recent years in the nature of these regions in their connectivity and in their functional significance. I’m not going to say anything more about the default mode regions today. I want to focus on the activations. I want to focus in on the time course of a few of these areas. The most interesting ones, I think, for us are these bilateral medial frontal areas. Here I’m showing you the timecourse in this left hemisphere region. This is the dorsomedial prefrontal cortex. “Dorsal” means it’s upper, “medial” means it’s on the middle of the brain and “prefrontal” means the anterior part of the frontal cortex. If you look at the response time-locked to the forty or so seconds while they’re reading, you see that there is an initial dip in activity when they focus in on the task, and then when they’re reading an intact story this area increases substantially in its activity and stays high throughout the duration of the paragraph. And it shows essentially no response when you’re reading a scrambled paragraph, and then it shows a big phasic decrease when the story ends. This is a really good candidate for something that is doing something special when you’re reading intact stories. It’s not just responding to reading and it’s doing it consistently throughout the story. You can contrast that kind of response with the response of something like the right inferior parietal cortex, which also seems to be having something to do with story processing here, because there’s a great difference between its response to intact stories and scrambled stories. Sorry, I said I wasn’t going to say anything more about the default mode. This is a default mode area network. But this is a default mode area network that seems to be actively inhibited throughout the duration of the story. And putting this result together with others, one possibility is that what’s happening is this area you may recall was one of the areas that was strongly associated with relatively long responses in the event responses to visual narratives. And it could be that in order to build a narrative in your mind from text, you have to inhibit the visual input in that model building that could be happening due to this region.

88

Lecture 5

And then we can contrast those kinds of responses with responses such as we see here in, this is the inferior frontal gyrus, a lateral frontal region where we see something of a response even when it’s scrambled and the response increases systematically throughout the coherent narrative. It has a much larger response for the intact compared to the scrambled stories and it’s an increasing response. One possibility is that this region is involved in mapping new information into your event model as it comes along. So when you’re having to abandon an event model after each sentence, it never gets very far. But when you’re able to build incrementally an event model, maybe it has more to do. And then we can contrast that with regions such as the posterior cingulate cortex, which show large deactivations both to scrambled and intact stories, but a larger one to the intact story. Okay. We see multiple flavors of responses. The one that I think is most important is this one in the dorsomedial prefrontal cortex where the activity comes on dramatically at the beginning of this story and stays high throughout the story and does nothing at all for the scrambled paragraphs. This is the only region in the brain that shows this response. This response is in no way right lateralized as we saw no evidence of hemispheric differences in this study. So, event model construction draws on specialized neural mechanisms. One that seems particularly important is the dorsal medial prefrontal cortex. There are others that may be important as well. This leads to a really interesting question about what is the relationship between event models constructed from narrative in the brain and event models that are constructed from observing something visually? This is a question that was taken on in a recent study by Zadbood and colleagues (2017), again out of the group headed by Ken Norman and Uri Hasson. This is another really wonderfully clever design. What they did in this experiment was have people watch television dramas and then recall those dramas. And then they had new participants come in and listen to the narratives that were given by these people. You can look at the brain areas responsive while someone is viewing a set of events, while they’re remembering them, or while they’re listening to someone else describe them. And you can relate all of those responses directly to each other because they’re about the same set of events. Here’s the method. In this study people watched one of two TV shows, Sherlock or Merlin. They watch the TV shows, then they recall a narrative of the show after some delay. Then, new participants go into the scanner and they listen to the narratives over headphones. And what the researchers can do, because these stories have really well-defined scenes, they can identify corresponding scenes in the movie and in a subsequent narrative. [[They]] look at the pattern of brain activity while someone’s watching that scene or telling

Neural Correlates of Event Segmentation

89

about it or hearing someone else tell about it and compare those three. And so then we can ask: Are there parts of the brain that show similar patterns [[or]] similar changes from scene to scene, whether I’m watching it or telling it or hearing about it? What I’m showing here is an analysis in which they asked where in the brain is the timecourse of activity similar between listening to a narrative or watching a movie? This [[left]] is for Merlin and this [[right]] is for Sherlock. There’s a number of areas that show this pattern, but note that one of the very strongest is this area in the dorsal medial prefrontal cortex. So again, here we’re looking at the medial surface of the left hemisphere and the right hemisphere. Here is the lateral surface of the left hemisphere and the right hemisphere. You see strong responses throughout the brain, strongest of all probably in these medial prefrontal cortex regions. We also see strong common responses in the same regions that were associated with long timescales in the Baldassano (2017) study. These areas that show stable representations over long periods of time [[and]] show strong consistency between listening to a story and watching a movie. We can also ask about the similarity between watching the movie and telling the story of the movie. And again—at least for Merlin—you see these strong responses in the medial prefrontal cortex and also throughout these other areas. I don’t know why you don’t get it for Sherlock in this case. That’s an interesting failure of that consistent pattern. Okay, so changes in the neural patterns during narrative comprehension track the event structure of the stimulus. The regions that are involved are the same ones that show the stability and memory-dependence over longer timescales. And by directly comparing corresponding narrative events in viewing and telling and listening, we can test the hypothesis that what’s represented there is not about the details of the visual input or the words or the process of articulation, it’s about the events that are being seen or described or listened to. So, just to wrap up what we’ve talked about today: The experience of event boundaries is associated with large phasic responses throughout the cortex. These responses are mediated by changes in the situation. And these responses hold across modalities. This is striking convergence between the responses to events that are visually experienced events that are described. There seem to be specialized neural mechanisms for discourse-level model construction and the one that is most consistent across the data that I showed you and many other data sets that have been characterized in meta-analyses is the medial prefrontal cortex. I don’t want to forget that event models also draw in part on modality-specific representations. We discussed these briefly in the second

90

Lecture 5

lecture and I didn’t want to come back and spend more time on them with this lecture because I’ve taken enough of your time this morning. But I just don’t want us to forget that you don’t get to these event models without going through modality-specific representations. Okay. So I’ll stop there.

Lecture 6

Prediction in Event Comprehension We have seen in some of the early lectures the central role that prediction plays in event comprehension—in particular, in the segmentation of events. I have tried to show reasons to think that events are segmented because of spikes in prediction error. Prediction plays a much broader role in perception, and in action control and in language comprehension. What I want to try to do in this lecture is [[to]] talk about that broader role that prediction plays in perceptual and comprehension processes and then relate that back to the discussion of the segmentation of events. Prediction is a hot topic in psychology these days, from very fast automatic mechanisms in object recognition and scene comprehension to very rapid mechanisms in motor control to more extended mechanisms in planning and language comprehension. I think that all of those kinds of predictions have some common computational features. There is another sense of prediction in English, which refers to things like predicting whether the stock market is going to go up or down or predicting who’s going to win the next election. These are deliberative, conscious decisions that we make. That kind of prediction, I think, is very different than prediction in the context of perception and language comprehension and motor control. That’s not the kind of prediction I’m going to be talking about. I’m going to be talking about the kinds of mundane predictions that happen every day on a fine timescale. In my lab we call this “everyday clairvoyance,” and what we mean is that the nervous system has an ability to anticipate, on many dimensions, what is going to occur in a short period of time. Let me start with perception. A great place to start is with Helmholtz’s Physiological Optics (1910). Helmholtz proposed what has come to be called the likelihood principle, which is: Given a particular visual input, we perceive the object that is most likely to

All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982455.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_007

92

Lecture 6

have been caused by the pattern of stimuli that we have received. This applies not just to visual inputs, but to auditory, haptic—any sensory modality. His idea is that from the available inputs our nervous system makes a rapid, automatic set of predictions or inferences about what is out there in the world. You can see this unfold in real time by looking at people’s eye movements as they look around a scene. Suppose that I come upon a scene like this, and initially my eye happens to be fixated here on this pig and this tractor. I get pretty high-resolution visual information about the area where my eyes fixated, but not much information about objects in the periphery, because we have a foveated eye that gives us only high resolution at the very center of our vision. If I’m looking here [[center]] and there is an object out here [[left side]], it’s basically a blob to my visual system, but I’ll probably interpret that blob as something that’s contextually appropriate for the scene. I’ll use information about the larger scene, like that it is a farm scene, to make a prediction about what this object is. If I was to foveate over here, and it was indeed a chicken, everything would be fine. But if it were, say, a mixer, something that’s contextually inappropriate for the farm, then that should be a prediction error. It turns out that if I have to identify the object over here, people are faster and more accurate to identify objects that are predicted than objects that would not be predicted based on the scene. The chicken is easy to identify in the context of a farm scene and the blender or the mixer is easy to identify in the context of a kitchen scene. And this effect is largely due to a bias, namely that our expectations based on the scene lead us to have a bias in favor of objects that are contextually appropriate and against objects that are not contextually appropriate. You can see this even more quickly if you look at computational processes that happened within a single glance, and one way to do that is with physiological measurement. In this experiment from Moshe Bar and his colleagues (2006), they use magnetoencephalography (MEG) which measures the magnetic signals at the surface of the brain and gives you very fine temporal resolution about what’s going on in the brain, in contrast to fMRI as we talked about this morning. They were interested in the relationship between three areas of the brain. Early visual cortex is the first place the visual information hits the brain from the eyes. We’ve known for a long time that information projects forward from the early visual cortex through a series of connections to the forward and bottommost part of the temporal lobes, namely the fusiform gyrus. So, this is an area on the bottom of the brain where you find cells that are highly selective for particular classes of objects. You’ll find patches of cells in the fusiform cortex that respond just to faces or just to cars or mostly

Prediction in Event Comprehension

93

to houses. There are visually responsive cells in this part of the temporal lobe that respond to specific classes of objects. In this case, there might be a patch of cells that respond selectively to umbrellas. In the orbitofrontal cortex are cells that are selective to categories however they are presented, whether we get information about an object from vision or from hearing or from merely thinking about it, and cells of orbitofrontal cortex are highly related to the context in which objects appear. Bar and his colleagues (2006) hypothesized that information about the broad scene that the object is occurring in—for example, the fact that this is a farm scene rather than a kitchen scene—might be extracted quickly based on low spatial-frequency information by the orbitofrontal cortex and then fed back to guide the fusiform cortex. “LSF” stands for low spatial frequency. What they’re trying to illustrate is a hypothesis that early visual cortex projects a low spatial frequency image up to orbitofrontal cortex, at the same time that information is being processed through the visual stream to fusiform cortex. The orbitofrontal cortex works quickly enough to give fusiform cortex information to make a prediction about what kinds of objects are likely to be present in the scene, which would constrain the fusiform’s response. The paradigm that they developed was to present objects briefly, with a visual mask before and after the objects. The objects are on the screen for 26 milliseconds. And before and afterwards, there was a high contrast image which makes it so that the visual system cannot get any extra information about this picture after it goes away because it’s blocked by this other information. And then they recorded while people were trying to recognize the objects and gave their confidence in their recognition. What they were interested in is the brain response in this period just as the object is being presented. They’re going to look at the difference between trials on which the person correctly recognized the object and trials on which they could not successfully recognize the object. This is just illustrating on an image of the brain viewed from below; here’s the fusiform cortex and here’s the orbitofrontal cortex. You can see that this is down at the bottom of the temporal lobes and this is at the bottom of the frontal lobes. What they’re showing here is that the peak of activity that’s related to recognition in the orbitofrontal cortex is quite early, barely more than a tenth of a second after the image is presented, whereas the initial peak of activity that’s differentiating recognized from non-recognized trials in the fusiform is about 50 milliseconds later. And that’s consistent with the idea that this early response over the frontal cortex is available to guide the predictions and influence cognition processes in the fusiform. Within a single glance, on very short timescales, parts of the brain are making predictions that can guide the processing of other parts of the brain.

94

Lecture 6

One of the neurophysiological signals that’s been used very heavily over a few decades now to look at predictive mechanisms is a component of the electroencephalographic response called the P300. In electroencephalography (or EEG), we’re measuring electrical activity at the surface of the scalp. EEG gives us fine-grained temporal information like magnetoencephalography, but not as much information about the spatial location. Way back in the sixties, Sutton and his colleagues (1965) reported a very characteristic response that you get if people are given a reason to predict that a stimulus is going to occur and then something different occurs. What they did in these experiments was to present a long train of either lights or sounds. You’d repeatedly see a light flash with a well-characterized frequency or you’d repeatedly hear a beep play over and over again. And then once you got used to that, every once in a while, if you’re listening to sounds, you might see a light instead of hearing the sound; or if you’re seeing a light flash, you might hear a sound instead of the light. And then in some of the trials they had an equal probability that you could see a sound or a light. They measured neural activity with EEG and they time-locked the EEG response to the onset of the sound or light. And this is a case where what you had was two different sounds, two different pitches. Two thirds of the time you had one of the sounds and a third of the time you had a different sound. When you heard the sound that you were less likely to be expecting because it was rare, we see a bigger deflection of the EEG signal than on the times that you heard the sound that you were expecting. In EEG, voltages are plotted with negative up and positive down, so this is a positive-going response about 300 milliseconds after the stimulus is present. Same thing with light, here you’re seeing in the solid light. The response to the light that you are expecting and then in the dashed line the response to the rare light. And you get a bigger positive deflection when you’re seeing the light that you weren’t expecting. Maybe the best assay of whether this is really reflecting prediction errors comes from an experiment where they presented two stimuli with equal probability, but they asked people before each stimulus to guess which one of the two you think it’s going to be. Here is the case where you had two different sounds with equal probability. People are just guessing it’s going to be “beep” or “boop”. If they guess right, they have a modest response. If they guess wrong, they have a bigger response. Similarly, you see two different lights. If you guess right, you get a modest response. If you guess wrong, there’s a bigger response 300 milliseconds after the stimulus comes on. This is very, very fast. It’s much quicker than we could vocalize whether we were surprised or not. So, vision and hearing make predictions based on context. These predictions can bias object recognition. The information to guide these predictions

Prediction in Event Comprehension

95

can manifest within the viewing of a single object. In the Bar study (2006), we see that low spatial frequency information about an object can be relayed forward to guide predictions while higher spatial frequency information is being extracted. And prediction violations lead to transient increases in neural activity that we can see with the P300. In language, there is a host of predictive mechanisms that operate on very fast timescales ranging from anticipating what the next phoneme is going to be in a speech stream to quite long responses reflecting predictions about prosody, turn-taking and discourse structure. What I want to focus on, because of its relation to event cognition, is discourse structure. These stimuli come from the study by Mark McDaniel (2001), my colleague, and his colleagues based on a paradigm developed by Gail McKoon and her colleagues (McKoon & Ratcliff, 2001). In these studies, what you do is [[to]] present a passage that gives the reader the opportunity to make an inference, and then you can measure the likelihood with which they made the inference by looking at how quickly or how accurately they can respond to the inferred material. In this experiment, if you read this first passage, you might wind up predicting that the protagonist of the story had died. The passage says, The director and the cameraman were preparing to shoot closeups of the actress on the edge of the roof of the 14th story building, when suddenly the actress fell. The director was talking to the cameraman and did not see what happened. If you read this and if you were constructing an event model describing the contents of the narrative, you might infer that the actress had died. You can contrast that with the passage that explicitly says that she died: The director and the cameraman were preparing to shoot closeups of the actress on the edge of the roof on the 14th story building when suddenly the actress fell and was pronounced dead. The director was talking to the cameraman and did not see what happened. It says straight up she died. Compare these two to a control condition in which there is no reason whatsoever to think that the actress died: Suddenly, the director fell over the camera stand, interfering with the cameraman’s efforts to shoot closeups of the actress on the 14th floor. By the time the camera equipment was set up again, it was too dark to continue shooting.

96

Lecture 6

In this case the actress is fine. What we’re going to do is that we’re going to measure the time to read the word dead. The logic underlying the experiment is that if you inferred that the actress was dead you might, by spreading activation, pre-activate your lexical representation of “dead”, which ought to make this word faster to recognize compared to the control condition. What you find is that the control condition gives us here an estimate of how quickly people can recognize the words in general. Compared to that baseline, when they read the predictive passage they’re quite a bit faster to recognize that word, just as much faster in this case as if they’d actually been told that she died, had read the word dead. Now, it’s important for me to mention that the degree to which you see these kinds of effects depends highly on the reading context. The data I just showed you came from a condition in which people were instructed to elaborate on the text, imagining the situation, the actors, and the actions. McDaniel and his colleagues (2001) also ran a condition in which they told people to focus on the precise details of the wording, which ought to discourage them from constructing a coherent event model and might suppress some of these inferential mechanisms. If we compare the situation condition to the text condition, we see again that if you explicitly activate the word “dead” by presenting it in the story, that speeds recognition. But if you present a passage that allows you to infer that the actress died, you see little benefit, if any. So, when people are focusing on the surface structure of the text, they are much less likely to make these kinds of inferences. In language comprehension, predictive mechanisms integrate information from the language with information from other sources to guide our predictions. This is a study from Gerry Altmann and Yuki Kamide (2007) in which people were presented with scenes that included a bunch of objects. And then they processed sentences that said either “The man will drink the wine”, “The man has drunk the wine”, “The man will drink the beer”, “The man has drunk the beer”. “The man will drink the wine” is inconsistent with the information that you could predict from the picture because somebody else already drank the wine here. And “The man has drunk the beer” is also inconsistent with information that you might predict from the picture. What you find is that when the noun wine or beer is pronounced, if the word is “will drink”, then people are more likely to fixate the beer. And if the phrase is “has drunk”, then they’re more likely to fixate the wine. For the ones that are inconsistent, they are just confused. But for the two possibilities that

Prediction in Event Comprehension

97

are consistent, they looked to the thing that would be predicted from the verb. So as you’re hearing “will”, you’re anticipating what the next word is going to be based on that verb complement and also based on the extralinguistic information in the picture. If you hear “will drink”, then before you hear what the noun is, your language comprehension system is getting information from this picture that allows to predict that the noun is going to be beer, while if you read has drunk, then your language comprehension system is getting information from the picture that allows you to predict that it is going to be wine. Here’s another example where extralinguistic information is integrated with the ongoing language comprehension to guide your eyes. This comes from a study by Matt Crocker and his colleagues (2008). They showed these pictures that included three characters, one of whom is a patient of an action—for example, the pirate here is being washed by the princess. Another character is both an agent and a patient—the princess is washing the pirate while being painted by the fencer. And then you’ve got one character who is the fencer who is only an agent and not a patient. In German, you can describe what the princess and the pirate and the fencer are doing using either a subject-verb-object form or an object-verb-subject form. If you do it in the SVO form, you might say something like “Die Prinzessin wäscht offensichtlich den Pirat”, meaning “The princess is apparently washing the pirate”. So the princess is the subject and the pirate is the object. Or you might hear something like “Die Prinzessin malt offensichtlich der Fechter”. And so this is again grammatically ambiguous at the point where you hear “Die Prinzessin” and it’s still ambiguous at the point at which you hear “malt”. So both of these “Die Prinzessin wäscht” could continue the princess is washing something or the princess is being washed by something. This sentence [[Die Prinzessin malt]] could continue the princess is painting something or the princess is painted by something. You can’t quite do this in English. The question is: What do people’s language systems think the form is going to be? Do they think it’s going to be a SOV or OVS sentence at the point where it’s ambiguous? What we can look at here is the proportion of looks to the pirate, the princess or the fencer in each of these cases. If you see the SVO sentence, what happens is you’ve got this ambiguous point and then soon after the onset of that verb, before you get information that grammatically disambiguate its whether it’s an SVO or an OVS, the people look to the appropriate character. They’re using information about the verb and combining it with information in the picture to look at the appropriate target. So when they hear something like “The princess washed”, then they look at the pirate early. If they hear “The princess paints”, then they look at the princess or at the fencer early. They are combining information that is in the sentence with information that is in

98

Lecture 6

the picture virtually in real time to make predictions about what the form of the sentence is going to be. We can see electrophysiological correlates with EEG of these kinds of predictions and in particular of violations of these kinds of predictions. There are two electrophysiological signals that I want to focus on: the N400 and the P600. The N400 is a negative evoked response that happens about 400 milliseconds after the word onset, hence its name. It is often strongest at sites over the right central and posterior cortex of back around here. But you shouldn’t draw too strong conclusions about where in the brain these things happen, because the electroencephalographic signal integrates over large slots of cortex, it is sensitive to some parts of the cortex and not others depending on the folding and depending on how close it is to the surface, and it is all filtered through the skull, so you can have a brain response in one part of the brain and have it show up on electrodes of our quite different part of the brain. The N400 is thought to reflect semantic integration. If you have a prediction about what the semantics are going to be and that fails, then you will have more difficulty integrating, and you can see that there is a larger N400. The P600 is a later response. It is positive-going rather than negative-going, and it is generally over posterior electrodes—sometimes on the right, sometimes on the left, sometimes bilateral. It is thought to reflect structural integration in the context of language, often syntactic integration. Here is the classic original description of the N400. What Kutas and Hillyard (1980) did in this study was [[to]] present a sentential context and then timelock the response to the onset of a single visually presented word. So you might read “It was his first day at work”, which should be relatively easy to integrate. Or “He spread the warm bread with socks”, which ought to produce a big prediction error and a big semantic integration burden. Or “She put on her high heeled shoes”, which will be another one that would be easy to integrate—but what’s interesting about this one is that the font that they used for the last word is a little bigger and bolder than the font for the rest of the words. They’re interested in comparing to the normal presentation, cases where you have something that’s semantic anomalous or something with a physical stimulus anomalous, but the semantics are fine. What you find here is if we time-lock to the presentation of each of these words, once you get to that last word you find that about 400 milliseconds later you have this big negative-going deviation in the semantic deviation condition. Whereas you get quite a different response in the physical deviation condition. The bold font produces positive-going difference. This is really a P300, whereas the semantic anomaly produces a little bit later and negative-going response, an N400.

Prediction in Event Comprehension

99

Here’s the initial demonstration of the P600. Osterhout and colleagues (1992) compared sentences like this one, “The woman struggled to prepare the meal”, to sentences like this one, “The woman persuaded to answer the door”. You have an intransitive verb with an object here, but this is perfectly syntactically fine. Here, when you read “The woman persuaded”, and you get to this to, then this is unpredicted given the syntactic context and it’s ungrammatical. In English, after you read “the woman persuaded”, you’re expecting a noun phrase, you’re expecting an SVO construction. The idea is that “to” ought to lead to a prediction error about the structure of the sentence. What you see when you time-lock to the word “to” compared to “struggled to” (the syntactically appropriate continuation, dashed line), this syntactically inappropriate continuation (solid line) produces a response that includes a large positive-going response around 600 milliseconds. An important thing about predictions in these language contexts is that readers can revise them if they need to based on the sentential context. This is a study in which participants read narratives that invited inferences and then gave the reader the opportunities to revise them. It says “Dan was a gypsy who had played flamenco since childhood. Now he is a popular musician who plays all over the world. Today, he is giving a recital of his favorite works”. This discourse context invites one to infer that the instrument that he plays is a guitar, because guitar is the stereotypical flamenco instrument. And then if you read “The concert takes place at the prestigious national concert hall”, that gives you no reason to revise that expectation. If you read “His instrument is made of maple wood, with a beautiful curved body”, again you’d still be thinking a guitar here. But if you read “His instrument is made of maple wood, with a matching bow”, at this point you’d have to revise your hypothesis about what the instrument was. [[You]] maybe infer that it’s a “violin” rather than a “guitar”. Then we’re going to time-lock to the onset of the word “violin” in this sentence, “The public was delighted to hear Dan playing the violin”. If you read either of these, you ought to think it’s a guitar at this point, so this ought to produce a larger N400 response, whereas if you have the opportunity to revise, then at this point you ought to think it’s a violin and so this ought to produce a smaller N400 response. And then they presented a final sentence where then they asked people about the story to make sure that they’re all understanding the stories. We’re going to time-lock to the onset of the word “violin”. And if what you’re doing is making elaborative inferences from this passage about what the instrument is, then you are going to infer “guitar”. Then in the revised condition, you ought to infer “violin” from “matching bow”. So when “violin” is presented, it should be less work in the revise condition. But I also want you to

100

Lecture 6

think back to the McDaniel’s study (2001) in which they manipulated whether people were attending to the meaning of the story and the situation or the surface structure of the words. If you’re not processing this passage at the level of the discourse and constructing elaborated event models, then you might never draw the inference about what the instrument is in the first place, and it might not make much of a difference which of these conditions you were in. It might not be particularly surprising that it’s “violin” if you never drew the inference that it was “guitar” in the first place. So, this is all caveated by “if you’re making elaborative inferences”. Here is half of their results. What they show if you time-lock to “violin” in the no-revise and neutral conditions compared to the revise condition, you get both a P300 and N400 response, so you get two prediction violation responses in those conditions. But like I said, I’m only showing you half of the data here. It turns out that before they ran the main experiment, they measured participants’ working memory span—how much information they could hold in mind while manipulating it. They split the sample into those with high working memory span and low memory span, and their hypothesis was that participants with low working memory span might be like the participants in the text condition in the McDaniel’s (2001) experiment. They would be too overloaded to be forming elaborated event models and so they shouldn’t be showing these inferences or violations of prediction when those inferences fail. Here’s all of the data, and now you can see the legend. These are the people with high working memory span. The people with lower memory span show the same trends, but to a considerably reduced degree. If you are less likely to be drawing these inferences, then you’re less surprised when those inferences are violated. Again, prediction depends on the discourse context and also depends on the abilities of the reader. Now, the initial interpretation of the N400 is that it reflected something pretty straightforward about the lexical similarity between the target word and the immediately preceding words. The idea was something like: We’re activating a semantic field as we process a sentence and things that are close by in that semantic field are predicting things that are far away or not. Jos van Berkum and his colleagues (1999) provided strong evidence that at least in some cases these N400 effects are really happening at the level of the event model described by the discourse. To do this, they created stories in which the words were always coherent at the level of the sentence, but they could be semantically unpredicted at the level of the higher discourse. I’m not going to try to pronounce the Dutch, but here is the English translation of their stimuli.

Prediction in Event Comprehension

101

As agreed upon, Jane was to wake her sister and her brother at five o’clock in the morning. But the sister had already washed herself, and the brother had even got dressed. So, if you then read “Jane told the brother that he was exceptionally quick”, that’s consistent with this setup. But if you read “Jane told the brother that he was exceptionally slow”, that’s inconsistent with this setup. Both of these are fine at the level of the sentence. It’s at the higher level of situation that there is a problem with [[the latter]] one. And what you find if you time-lock to “slow”, you get a nice N400 right there, even though compared to the preceding words, it’s just as related to the preceding words as quick. The idea that I want to put forward here is that at any given moment, you’ve got a set of situation features in your event model, and you’ve got a set of linguistic features. And you are always deriving predictions about what you’re going to be encountering in the near future. You can predict situational features from situational features, you can predict linguistic features from linguistic features, you can predict linguistic features from situational features, and situational features from linguistic features. That the system is using both of these sources of information to predict about both these sources of information. To summarize for this first part of the lecture, readers’ brains make predictions about upcoming input. These predictions cover both linguistic features and features of situations described by the language. They can be driven by both linguistic features and by situational features. Violations of predictions lead to phasic changes of brain activity that we can see with measures including the N400 and P600. What I’d like to do is to take a quick break and then we will link all of this back to the segmentation and of events in the construction of event models. What I’d like to do now is to relate considerations about prediction in extralinguistic event comprehension and in language comprehension back to the experience of events in perception and language. Just to reiterate, event segmentation theory proposes that event models are updated because you experience a spike in prediction error. That proposal leads to a straightforward hypothesis, namely that comprehenders ought to experience event boundaries as being points in time at which things are less predictable. In a couple of different studies, we’ve asked people to just rate explicitly how predictable things are. In other studies, we’ve measured using objective means how well people are able to predict and the hypothesis is they ought to be less able to predict at event boundaries.

102

Lecture 6

Let me start with a study that comes not from my lab but from Gabriel Radvansky’s lab (2016). So, you read a passage that starts as Jenny was listening to the radio. She had been stressed all day and was finally unwinding. The DJ was spinning some really great stuff…. She turned up the volume and adjusted the bass level. In that case, there’s no shift in the situation. And then you read “It was loud enough to rattle her windows”. But if you read “The next morning she got up and turned on the radio”, that’s a huge time shift. In both cases, at this point in the story the radio is on. But in this case, it’s still night time and we know it’s loud. In the next morning case, it’s the first thing in the morning, maybe it’s not going to be so loud. In between, they varied the particular material that came before the shift. So if you first read “She really liked this song, especially the pounding bass line”, then there’s no reason to expect that there would be a shift to the end of the evening. However, if you read “Unfortunately, she would have to call it a night soon”, that foreshadows the shift. They were interested in whether the shifts indeed would be rated as less predictable than the no shift condition, and whether that would be reduced by the information that foreshadowed the shift. In fact, this sentence is rated as less predictable than this sentence. But if you give this foreshadow beforehand, that mostly eliminates the effect. So as readers are reading, in general, shifting the time in this case (but similar predictions would be made for space and cause and characters and goals and so forth), people are making predictions about what those values are going to be coming next, and most of the time we’re predicting that things are going to stay the same. When we experience a shift, that is unpredicted. But if we get information that foreshadows that shift, then we can reduce that effect. We’ve also looked at explicit ratings of prediction in the Raymond stories. In this study, what people did was read narratives one clause at a time and just rate how well they could predict that a clause was going to happen based on the clauses that preceded it. We can compare those predictability ratings with the segmentation data that I presented yesterday afternoon. As you’ll recall, the narratives were presented for segmentation over headphones read by a narrator, on a screen one clause at a time, or on a page, a whole page at a time. Participants segmented in those three modalities. In this separate experiment, they read things one clause at a time and rated how predictable each clause was. These are the results of a multiple regression analysis in which we asked: How strong is the relationship between event boundaries and predictability

Prediction in Event Comprehension

103

ratings in each of the modalities? What this is telling you is that when there is a coarse boundary, people judge the clause to be less predictable—that’s the negative sign in the auditory case, in the visual-continuous case and in the visual-sequential case. And, again, this is the case for fine boundaries. These effects are relatively small; they account for a modest amount of the variance in how predictable the clauses are rated. But you have to remember that there are a lot of other things going on in these stories. This is on top of all the other things like the particular words that are used for syntactic constructions, and the readers’ idiosyncratic expectations about what might happen in the story. On top of all those other things, in the midst of all those other things, event boundaries are experienced as being less predictable. We also see a quite strong relationship such that things that are less predictable are read more slowly. The harder that clause is to predict, the more slowly it’s read. So, people slow down when things are less predictable and they are more likely to experience an event boundary when things are less predictable. Situational shifts render narratives less predictable. This is associated with the perception of an event boundary, and with slowing in reading. I like those measures and I also like the explicit prediction task that I introduced way back in the first lecture—we’re going to revisit them in a moment. But one thing that is problematic about those kinds of measures is that you have to stop the ongoing comprehension processes in order to administer them. It is pretty unnatural to be reading and then being stopped every clause and asked, how predictable was that clause? That unnaturalness could change the way that people are processing the text. We would like measures of prediction error that don’t interrupt the process that you’re trying to study. We have been using two in the lab, one I want to tell you about depends on eye movements, the other depends on functional MRI. The eye movement measure works like this: You watch a movie of somebody doing an everyday activity. This is one of the clips that I showed you before, of a woman making breakfast, and what we’ve done here is [[to]] slow it down by fifty percent. The pink circle shows where the viewer was looking at any given moment in time, and the yellow boxes show objects that the person is going to reach for. The boxes come on three seconds before her hand is going to reach that object. And what you can see is that the eyes get into the box every time pretty much before the hand does. So, this is a form of rapid and automatic prediction where the people are anticipating where the actions are going to be and looking there before the action gets there. This is a phenomenon that has been characterized by Mary Hayhoe (Hayhoe & Ballard, 2005) and other people, and it’s a powerful effect in our visual exploration of dynamic scenes.

104

Lecture 6

We hypothesized that predictive looking would be degraded at event boundaries. So what Michelle Eisenberg and I did in this study was [[to]] divide up all of these object contact instances into ones that occurred near an event boundary, either a fine boundary or a coarse boundary, or in the middle of an event. These are two experiments, each of them has 32 participants. They used different movies. In the first experiment, there was a movie of a man making copies and putting together a binder, a man sweeping the floor and an actor changing a car tire. In the second experiment, a woman making breakfast, a man preparing for a party, and a man planting house plants. What we can do is [[to]] plot in the moments leading up to that object contact. I’m going to do this in half-second intervals. What proportion of subjects looked at the target box? By the time the object contact happens, everybody basically has looked to the box. This is a high degree of predictive looking and [[predictive eye movements]] are more and more likely as we approach the object contact. This is showing, for Study 1 and Study 2, that they are more and more likely to look in the box as the hand gets close to it. But for both of them, there is an interaction such that this curve is steeper when you’re near a fine boundary or a coarse boundary than when you’re in the middle of an event. And so you can see it’s most clearly here in study 1 of the patterns similar in study 2 where they’re catching up later when it’s near an event boundary, they’re less able to look predictively. Now what I want to do is [[to]] revisit that explicit prediction task that I described in the very first lecture and combine it with functional MRI to look at the mechanisms of prediction in event model updating. The hypothesis is that deliberately trying to predict what is going to happen in the near future ought to be impaired around event boundaries. And this ought to be associated with neural mechanisms that are known to be involved in signaling prediction error. The neural system that we’ve focused on here as a good candidate for doing event model updating is a midbrain system focused on dopamine cells in the substantia nigra and the ventral tegmental area. These are two little bits in the middle of your midbrain that have a high concentration of neurons that signal with dopamine. This is the area that is disrupted in Parkinson’s disease and disruptions to this area produce behavioral problems with predictive processing. The midbrain dopamine system is a great candidate for being a trigger of an event model updating because it has broad projections throughout the frontal cortex, both directly and indirectly through structures in the basal ganglia, the

Prediction in Event Comprehension

105

caudate nucleus, and the putamen. It is known to have many cells that are responsive to prediction violation. So we’ve focused on that area. I’m going to remind you of the paradigm and the behavioral data. The paradigm involves watching movies of an everyday activity and then we paused the movie from time to time, and asked people to predict what’s going to happen in five seconds by selecting one of two pictures. In this case, this picture shows what’s going to happen in five seconds whereas this picture was taken from another movie in which she was waxing her car rather than washing it. And what we do in the behavioral studies is [[to]] stop the movie here, ask them to make the prediction, and then restart the movie. We’re going to manipulate whether we stop the movie in the middle of an event, or two and a half seconds before an event boundary, such that you have to predict across that event boundary. The hypothesis is that doing this prediction ought to be more difficult and ought to activate the midbrain dopamine system. These are the data I showed you back on the first lecture. Prediction is less accurate and slower across event boundaries than within event boundaries. And people are less confident about their prediction responses when they’re trying to predict across an event boundary. Now we’re going to look at the neuroimaging data. This is just illustrating that if you take a slice right through about the middle of my ears, you’d see the substantia nigra and the ventral tegmental area, right there in the middle of the brain. They project up to the basal ganglia to the caudate and putamen here, and also throughout the prefrontal cortex. For the imaging experiment, again we’re going to stop the movie and ask them to predict. We’re going to time-lock the brain response to that moment which they’re trying to predict. The idea is that integrating that prediction error is accumulating as you’re trying to forget which picture isn’t appreciated that you don’t know ought to be associated with larger responses in those areas. We also hypothesize that, in general, trying to do this ought to be associated with prediction errors, but they ought to be larger when you’re trying to predict across an event boundary. What you find is that throughout this system, except in the ventral tegmental area where we did not get consistent signal, in both hemispheres in the left and the right, in the substantia nigra, caudate and putamen, you get reliable responses when people are trying to predict. And in the right substantia nigra and marginally so in the caudate, these responses are larger when you’re trying

106

Lecture 6

to predict across an event boundary than when you’re trying to predict within an event. Converging with the eye movement data, explicit prediction is worse at event boundaries and this is associated with increased activity in the right substantia nigra. More broadly, what I’ve tried to argue today is that prediction is a ubiquitous feature of comprehension and action control. Prediction failures are adaptive moments at which to update event models. This is the time in which an event model is no longer serving well and you’d want to update it. I want to end with one little caveat, which is that in many natural comprehension situations it can be really challenging to distinguish between predictive forward-looking processes and backward-looking process. One reason is that in natural environment and in language, prediction failures co-vary with other things that are happening. As we saw, prediction failures tend to happen when lots of things are changing. For example, if you see a neural response, is that neural response really reflecting the prediction error or is that reflecting that things are changing? In the semantic priming literature, there is a great debate about whether seeing one word predictably activates related words thereby making it faster to recognize related words, or whether we’re faster to recognize related words because when we see the second word, it fits better with the stuff that came before it. In other words, is the system predicting forward or is it looking back and checking the fit with the stuff that came before? The predictive looking test and this explicit prediction test, I think, are the strongest effects that we’ve been able to account for so far to test these in the domain of event comprehension. And I think that many of the language data that I showed you converge in really providing strong evidence that these are not merely matters of retrospective checking of fit, but really are predictive. I have tried to select paradigms that I think are strongest for making a truly predictive case, but I do want to note that as a caveat that we need to look at when considering these things. To end on the positive, I want to reiterate that prediction is a ubiquitous feature of living our lives, and we would not be effective animals if we were not able to anticipate what’s coming at us and respond to it before we lose the opportunity to take advantage of positive things and to take advantage of the opportunity of avoiding negative things. So I’ll end there and take questions.

Lecture 7

Updating Event Models Thank you so much! Thank you! It’s an honor to be here. Today I’d like to talk about how we regulate the contents of our minds over time, [[and]] how we maintain a representation of what’s happening around us, and update it as the situation changes. This is a form of memory, and it’s one of many forms of memory that exist in the human nervous system. My colleagues in psychology have many terms for memory. Here are just a few of them: We talk about episodic memory, semantic memory, short term memory, working memory, procedural memory, declarative memory, implicit memory, explicit memory, priming, learning, [[and]] knowledge. I should have included long-term memory [[on the slide]]; that’s a grave omission. The two that I’m going to focus on today are usually described as working memory and episodic memory. But I think that all of these terms fit—to varying degrees— poorly to the actual facts on the ground. I think that our theories need to look at the notion of memory afresh and take into consideration the functional properties of the system that we’re actually studying. When doing so, there’s a fundamental distinction that I think underlies the division of labor amongst memory systems. That’s a distinction between representations that the nervous system maintains in virtue of activation-based recurrent neural firing and permanent changes to the synaptic structure of the brain. Activation-based memory systems depend on physiological effort to maintain a representation over time. They persist only as long as they are protected from disruption. They have limited capacity. They’re fast. They’re nondestructive: Maintaining something in an activation-based representation system generally does not interfere with other learning that we’ve acquired over the course of our lives. A couple of great examples of activation-based memory situations are, for example, if a visual stimulus is presented to you and then removed, you can report a lot of the information about that visual stimulus for a brief period of time. But it All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982464.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_008

108

Lecture 7

rapidly fades and much of the detail is lost over the course of a few seconds. If that stimulus is masked by another visual representation that it will be obliterated pretty much instantly. Another great example is the kind of memory that we use if we’re trying to remember an email address or a telephone number and hold that in mind while we walk to the computer [[or]] walk to the phone. (Though I don’t think anybody has to remember phone numbers anymore.) Weight-based memory depends on permanent synaptic changes to the nervous system. If we may store something in weight-based memory, it’s there pretty much forever. It has huge capacity. In fact, some neuroscientists and psychologists have argued that the limitation on the capacity of weight-based memory is not the number of items that can be stored in the brain, but just simply the rate which you can cram stuff in there. Weight-based memory is often slower. It often requires repeated experience to build weight-based memories. However, there are some systems—particularly long-term potentiation—that can be quite fast. One of the key features of weight-based memory is that new representations in weight-based memory systems can disrupt the storage of previously stored information, so they can interfere. These considerations pose a challenge for memory systems in an everyday environment like this with a few dozen people, a bunch of furniture, windows, a spatiotemporal framework, and a set of events in unfolding dynamically overtime. There’s a lot to keep track of and activation-based storage is expensive in terms of the hardware demands—the number of neurons needed per unit of information stored—and in terms of the metabolic demands, weight-based memory requires the expending energy on an ongoing basis to keep something activated, as opposed to synaptic storage, for which once the synaptic changes are made they do not require much in the way of ongoing metabolic activity to sustain them. So, activation-based storage is expensive. Our everyday environments are complex. They have a large number of entities and relations that could be relevant to guiding our actions at any moment. And we somehow represent this large complex situation with limited activation-based storage without expanding too many resources and without using up the limited capacity of our hardware. One proposal that psychologists have made is that activationbased memory systems can leverage weight-based semantic memory systems by using the limited capacity system to essentially maintain pointers into semantic memory representations that have been learned over time. One example of a theory like this is Ericsson and Kintsch’s (1995) longterm working memory account. In Ericsson and Kintsch’s account you have a current environment, and then you have a set of long-term representations, experience, knowledge and beliefs that you stored over time. What working memory does is, based on the current state of the environment and the current

Updating Event Models

109

state of my knowledge, retrieve relevant information from long-term memory and maintain a set of links to the stuff that is currently relevant in this much larger-capacity store. Then this is illustrating how, over time, that activated working memory system is then going to lead to actions that affect our environment, lead to the storage of new working memories, and lead to updating as the situation evolves in time, which is going to change the state of the environment and also change the state of our working memory. A related proposal comes from Alan Baddeley (2000), the author of maybe the most influential theory of working memory in cognitive psychology. Baddeley’s initial model of working memory proposed that the structure of our memory for things that are immediately available includes two storage systems: the phonological loop and the visuospatial sketch pad. He called these slave systems to indicate that they’re relatively dumb systems. The phonological loop is a system for storing things in an auditory format. It’s the kind of system that we depend on heavily when we need to hold onto that email address in our minds. It basically can cycle through a brief window of auditory input and maintain it in either auditory phonological code for some period of time. But it is easily disrupted by competing auditory or phonological information. For example, if I give you an email address to maintain and then ask you to just start uttering the syllable “ba”, that will very much disrupt this memory system. The visuospatial sketch pad is a representation in a visuospatial format. It allows us to essentially maintain mental images in which we can then manipulate. It is very much easily disrupted by competing visual information. If I show you a new picture, that will disrupt the visuospatial sketch pad. In the original theory, there is a central executive module that is responsible for loading information in and out of these two slave systems, and the slave systems can also interface with our long-term memory but in very limited ways. This system in the original model is basically just responsible for saying when to update this [[phonological loop]] and when to update this [[visuospatial sketch pad]] and how to get information from this into these systems. But Baddeley, in his 2000 paper, observed that there’s some real limitations to what that system can do that don’t seem to account for some of the capabilities that people have. For example, people with severe brain lesions to structures in the higher auditory areas that radically disrupt their phonological loops can still often have pretty good comprehension of narrative language. Healthy people can remember meaningful strings of text that go on to the length of paragraphs or so, whereas the capacity limitation of the phonological loop is really just a few seconds worth of information. How are we going to account for the fact that people in their language comprehension and in other domains seem to be able to maintain more information in working memory than these slave systems

110

Lecture 7

can account for? What he proposed is a third system which is not quite a slave system; it’s another storage system that he called the episodic buffer. The idea of the episodic buffer is that it represents a structured representation of the current situation that leverages the limited capacity of weight-based storage by depending heavily on our knowledge about how objects work, how special arrangements in an environment work and how sequences of events typically unfold. Now, this may start—for those who have been at the earlier lectures— to sound very much like a representational format that we’ve encountered before, namely the event model. Just to recapitulate, event models are structured representations of events. Like mental images, they are in part isomorphic to the content that they represent. They depend on neural sensorimotor maps and in this sense they are embodied representations (this is something that we talked about a bit in Lecture Four and also in Lecture One). However, like propositional representations, they are componential. They have parts that can be rearranged. Unlike an image or a picture, they have meaningful components that can be rearranged that correspond to the components in the events that are being represented. One particularly important set of event models are what Gabriel Radvansky and I (2014) call working models. These are representations of our immediate environment what’s happening now. These are the representations that allow us to keep track of what’s happening in front of us, including the things that we don’t happen to be looking at or the things that become included or the things that are outside of our central vision. The trick about maintaining a useful working model is that working models need to be updated quickly and efficiently. This is an appropriate property for activation-based representations. Activation-based representations can be rapidly updated. However, they need more capacity than solely activationbased representations can provide. And that’s more appropriate for weightbased representations. In the third lecture, I presented one theory of how this updating can work—namely Event Segmentation Theory, which proposes that we solve this balance by updating event models when we get evidence that our current event model has gone bad in the form of prediction errors. To summarize, event models depend critically on activation-based maintenance. However, they also leverage long-term memory to dramatically increase capacity. This can include semantic memory—knowledge about how the world typically goes, particularly in the form of event schemas or scripts and these are two classes of system models as we described on the first lecture, and also episodic memory—my memory for what happened in my recent past. Some of this knowledge is embodied. It’s represented in the same representational medium, the same manifold as our sensory and motor experience.

Updating Event Models

111

What I want to do in this next section is [[to]] describe some empirical evidence that shows how people update their working models in virtual and real-world environments, and then later in the talk we’ll get to some work on updating in narrative representations. I just first want to start with a kind of intuition pump, so tell me if this has ever happened to you. You realize that you need something. This happens to me all the time. I’m sitting in my office and I realize I need something from the lab down the hall. I walked down the hall and then I’ve gone into the lab. Usually someone grabs me right as I walk through the door and asks me ten questions. By the time that is happened, I have no idea what I came down to the lab to do. And then I usually walk back to my office and I realize, “Oh yeah that was the thing!” and then I do the whole thing all over again. Hopefully, by the second or third try I actually accomplish the task that I initially went down the hall to do. This is a little cartoon that illustrates the same idea. This guy in the first frame can’t remember why he walked out of that room. So he goes back into the room and then he remembers why he walked out of the room. Gabriel Radvansky and his colleagues (2006), in a series of experiments, have been exploring this phenomenon in a lot of detail. They call these papers the “walking through doorways causes forgetting” studies. In the initial studies, what they did is to build a virtual environment in which people could navigate. They had a little virtual backpack. They would walk up to a table and put down whatever was in their backpack on one side, and then pick up whatever was on the table. Then they would either walk across a room to another table, or they would walk in equivalent distance through a doorway to another table. At each table, they’re going to put down an object they’re carrying and pick up a new object. The successive tables are either in the same room or separated by a doorway, with equivalent distance in travel time. Then from time to time, they are probed where the experimenter asks them, “Is this in your backpack or is this an object that you just put down?” The hypothesis is that if they are probed with an object that they currently have in their backpack, that ought to be represented in their event model. If they then walked through a doorway, with some probability, that information might not make it into the new event model that they construct when they update their models. So, the hypothesis is that when they walk through the doorway, they’re going to update their event models, and at that point they may no longer have that information about what was in their backpack. Now the information about the object that they just put down shouldn’t be in their event model anymore in either case. So, the hypothesis is that—particularly for things that are in their backpack—memory for that thing that you just picked up and you’re carrying it ought to be weaker.

112

Lecture 7

This is what they find. For objects that they are carrying, people are faster to respond to probes if they walked across the room with no doorway than if they walked across the room with a doorway. For objects that are just put down, there is no effect. For all the other objects—the ones that they are supposed to say no to—there is also no effect. This effect also works for verbal materials. In a subsequent experiment, they had people either pick up an object or memorize a pair of words such that they were instructed to be able to recall the second word in response to the first word. Then, after walking either across the room or through a doorway, they were tested for either the most recent object that they picked up or offer a word pair. In this experiment they had no items that they just put down. What you find is that, again, that thing you just picked up that you’re carrying with you is less accessible in your memory if you walk through a doorway. For the word pairs, the effect is even bigger, so this arbitrary pair of words that you just remembered is a little bit harder to retrieve if you just walk through the doorway. You might worry that there is something funny about this virtual reality test, that it doesn’t really correspond to how we walk around in the real world. To get that, Radvansky and his colleagues (2011) did a series of experiments in their laboratory. This is a diagram of GS laboratory at the University of Notre Dame, and what they do is set up a bunch of tables in the lab such that people could walk from one table to another either within the same room or through a doorway. On each trial, the subject approaches the table with an inverted box. The box is upside down on top of the table. They pick up the box, revealing six simple colored geometric objects. They put the objects in the box and then they walk to the next table and put the box down. Then they perform two minutes of arithmetic problems to try and encourage forgetting. For these real-world objects, if you don’t impose an intermediate task to fill up working memory, they don’t forget. Then they get a recognition memory test for each of the six items. The experimenter asked them “was there a red cube?”, “was there a blue sphere?”. And maybe there was a red cube but not a blue sphere, and you have to answer which items were there. Then they uncover a new set of items and continue. What you find is that people are more likely to forget that they are carrying this item in their box if they just walk through a doorway, and they are also more likely to falsely recognize items that were not in the box if they just walk through the doorway. So, walking through doorways causes forgetting. It works on computer screens, it works in VR, and it works in the real world. It works for verbal materials and for physical objects.

Updating Event Models

113

In my lab, we’ve been interested in exploring the updating of working models in domains that are more close to naturalistic unfolding events and one domain that’s been really helpful for this problem has been commercial narrative film. In one set of experiments conducted by Khena Swallow (2009, 2011) when she was in the lab, we showed people sequences from commercial film scout from around the world. This is a Jacques Tatis movie, a French movie from the sixties. It’s about Paris after the war and it has minimal dialogue which is nice for us. It has lots of objects coming and going, and it has these transitions like this one when he walks to that door. All of our participants tell us that’s an event boundary. If they are right, then you might expect that viewers would update their models at that point. So when I ask you which of those objects was just on the screen, you might have a little trouble. Let’s do a quick poll. How many would vote that it was the cat? Okay. How many would vote that it’s the chair? Oh you guys were probably below chance. It was the chair. This is typical; people are really bad at this. We are mean experimenters, so we restart the movie to show them that the chair was right there in the middle of their visual field. The hypothesis is that seeing him walk through that doorway caused you to update your event model and rendered the chair less accessible in your memory. Now, the design of this experiment is pretty straightforward. People watched a large number of these clips from commercial cinema. They had object recognition tests interspersed. This is really important: we always wait exactly five seconds after the object goes off the screen to present the memory test. The lag between when the object went away is always just five seconds. We also recorded where they’re looking in a bunch of these experiments, to know whether they are fixating some of the objects that we were testing more than others. None of the effects that I’m going to show you are due to differences in gaze position. Then we evaluated the effects of events segmentation using boundaries identified by independent observers. A separate group of participants watched all the clips and told us where the event boundaries were. That’s how we knew that walking through the doorway was an event boundary. The design has two factors that are factorially manipulated. In this example, imagine time is going by like this, the chair comes on the screen at this point and then goes off the screen at this point, so that’s all we’re indicating here. Then five seconds later, we test memory for the chair and then we’re indicating where the event boundaries were with these red lines. So, the kind of trial that you saw was actually this kind of trial. According to our observers, there was

114

Lecture 7

no event boundary while the chair was on the screen. Then there’s an event boundary just as the chair goes off the screen and then we test memory. The hypothesis is that memory in this situation ought to be really bad; memory in this situation ought to be really good for one thing because the chair was still on the screen in the same event. So it’s unlikely that I updated my event model between when it goes off the screen and when we test. Similarly, in this situation, the chair is on the screen in the same event in which I test. The difference between these two is that this one was encoded for some of the objects. The object was on the screen during a previous event boundary. We are interested in testing the hypothesis that when you do the updating of your working model, the processing that’s required to load information into a new working model also facilitates encoding into synaptic-based long-term storage. This kind of object ought to get two kinds of benefits. First of all, it’s still in your event model; and second of all, it benefited from this extra processing of the event boundary. Now this kind of object is really interesting because if you updated your event model it’s no longer going to be in your event model, but if it benefited from the extra processing happening at the event boundary, then it might be present in your synaptic-based memory and it might still be available under the right circumstances. We predict that memory in these two situations would be great, memory in this situation would be bad and this is a really interesting one. Here is what you find. If you measure the proportion of items that people correctly responded to, what you find is that as long as we test within the same event, memory is great. If we cross an event boundary and the object was not encoded at a previous event boundary—the condition that you guys were in—you’re guessing; performance is not significantly different from chance in this case. But if we cross an event boundary and the object was previously encoded during a previous boundary, memory is still great. In fact, you notice it’s a little bit higher here. This difference is small but it turns out to replicate and be statistically significant, and that’s interesting. So, this suggests that processes that happen during event model updating also facilitate encoding into long-term memory. What are those processes? One hypothesis is that the medial temporal lobe memory system is particularly important for making rapid synaptic changes by a long-term potentiation that would enable this strong performance. To test this, we ran a functional MRI study in which we measured activity in the medial temporal lobe memory system as people attempted to retrieve objects either within an event or across an event boundary. This is showing you a cutaway view of the brain, as if the person was standing at about this orientation, highlighting the hippocampus here right down in the middle toward the bottom of the brain. This is a set of a coronal slice

Updating Event Models

115

going this way through the brain, illustrating the hippocampus on the two hemispheres and the adjacent parahippocampal gyrus. Both of these structures are known to be important for forming new episodic memories rapidly. So we hypothesized that activity in the hippocampus and perhaps in the parahippocampal gyrus would be stronger when people are in that condition when they’re reaching back across an event boundary, but they’re still successfully able to grab information about what happened. And that’s exactly what we found. This was the condition where you’ve crossed a new event boundary but you encountered the object during a previous event boundary. You can see in none of the other conditions do we activate the hippocampus, but in this condition we activate it robustly. This is true both in the left hippocampus and in the right hippocampus. We see a similar interaction in the left and right parahippocampal gyri where that condition is the strongest activated condition. But in those areas we also see activity as well for the other conditions just not as much. What I want to conclude from that paradigm is that the past, what counts as being the past, what’s the boundary between our conscious present and everything else, is the updating of a working model. What allows us to reach back into the past and retrieve something across an event boundary may be particularly dependent on the medial temporal lobes. Before we turn to updating in narrative comprehension, I think this is a good time for a short break. I want to turn now from mostly studies of updating in the real world or in movies to studies of updating in narrative comprehension. Just a quick reminder: In Lecture Four we saw that when someone reads a narrative in which there’s a situational change, with some probability, we see that they identify that as an event boundary where one meaningful unit of activity has ended and another has begun. And also, if they’re reading they tend to slow down at those points. I’ve suggested that the reason they’re slowing down is because in response to that event boundary, they’re updating their current event model. So what are the consequences of that for memory access? The obvious implication is that if that slowing in reading is reflecting the updating of a working model that then if I ask you to retrieve information from the previous working model, it ought to be more difficult after having updated. One paradigm that was developed to test this was developed by Daniel Morrow and Gordon Bower and their colleagues (1989). In Lecture Four we saw a study based on this paradigm. This is a slightly different map set up but it’s the same basic idea. Participants in these experiments memorize a map like this until they can identify from memory the locations of all the objects in the map. Then we’re interested in what happens when people read stories that include sentences like “Wilbur walked from the reception room into

116

Lecture 7

the lounge”. Morrow and his colleagues proposed that what happens when you read that is you form a succession of working models that begins with a working model in which Wilbur is in the reception room and ends with the working model in which Wilbur is in the lounge. Interestingly, they proposed that under these circumstances people really take the spatial structure of the environment seriously. So they actually form three working models: one in which he’s in the reception room, one in which he’s in the experiment room on his way to the lounge, and then one in which he’s in the lounge. This is kind of striking because the experiment room was never mentioned in the text. If you then were probed for the location of an object in the experiment room, you ought to be better at recognizing that object because you had just formed a working model in which that room was the spatiotemporal framework compared to reading say “Wilbur walked from the reception room into the conference room”. If you just updated your working model and he’s no longer in the experiment room, then objects in this room ought to be better than objects in that room. This is what you find. This corresponds to the lounge; the path room corresponds to the experiment room; the source room corresponds to the reception room; and compared to all the other rooms, these three rooms are faster. The path room, which is never mentioned in the sentence, note, is quite a bit faster than the other rooms and as fact even faster than the source room which was mentioned in the sentence. This is kind of striking evidence that readers are forming the succession of event models. One thing that they got interested in is: “What is the relationship between shifts in the temporal dimension and the passage of time in the real world with respect to updating working models?” To test this, Mike Rinck and Gordon Bower (2000) designed narratives like this one. In this experiment, again, participants memorized the map and then they read stories that took place in the environment depicted by the map. This one says: Calvin was one of the janitors at the research center. Tonight he slowly changed into his work clothes in the washroom. He didn’t like the job much, but he had to keep it because he needed the money to stay in architecture school. When he opened his locker he noticed a note: Director of center has misplaced top secret report, must be found immediately! He would have to make a thorough search of the center during his shift. He went into the repair shop, but he couldn’t see any papers there. So he walked from the repair shop into the experiment room.

Updating Event Models

117

Some of the participants would see either one or five intervening sentences, so they might just go straight from this to this final sentence or they might read these intervening sentences. This room was a big mess, and Calvin would have to clean it up before he could go on. Looking around, Calvin thought that someone must have had a party in here. He saw empty pizza boxes, coke cans, bottles of beer, and bits of popcorn everywhere. There was also a puddle of beer on the floor because someone had dropped a bottle. Then they read this final sentence, either “After ten minutes Calvin was finally done cleaning up the room”, or “After two hours Calvin was finally done cleaning up the room”. We can contrast here two senses of time: narrative time is manipulated by just shifting changing these two words, either ten minutes or two hours. Real time is manipulated by the number of sentences in here. Note that after reading you know five of these intervening sentences, your phonological loop— the slave systems in the Baddeley memory model—ought to be completely over capacity. It is unlikely that anything from before those sentences would be present in those systems. Then, we can probe your memory and ask, “Is the clock in the experiment room?” We haven’t changed location at all. All we’ve done is either shifted time or shifted time in the real world. Here’s what you find. Compared to a control condition in which there are no intervening sentences, if there are just two intervening sentences, people are a bit slower; if there are five intervening sentences, they’re slower yet. But note that for both the two sentences and the five sentences condition, there’s as big an effect of changing from ten minutes to two hours as there is of adding five extra sentences. That’s a pretty dramatic effect on memory accessibility of this updating of the narrative situation. Before we leave the map experiments, I just want to remind everyone of the fact that we saw in Lecture Four, which is that all of these effects of spatial shifts on updating in narrative are dependent on the reader’s goals. If these experiments were to be repeated without making people study the maps, I’ll bet you would not see these kinds of effects. Space is an interesting dimension because the extent to which people track it is highly context dependent. With that in mind, let me return to an experimental paradigm that we talked about in that same lecture. You may recall this story from Rolf. Zwaan’s (1996) paper about a gallery opening. It says:

118

Lecture 7

Today was the grand opening of Maurice’s new art gallery. He had invited everybody in town, who is important in the arts. Everybody who had been invited, had said that they would come. It seemed like the opening would be a big success. At seven o’clock, the first guests arrived. Maurice was in an excellent mood. He was shaking hands and beaming. Either a moment later, an hour later or a day later he turned very pale. He had completely forgotten to invite the local art critic. And sure enough, the opening was very negatively reviewed in the weekend edition of the local newspaper. Maurice decided to take some Advil and stay in bed the whole day. What we saw in Lecture Four was that, compared to “a moment later”, “an hour later” [[or]] “a day later” is read much more slowly. I suggested that what was happening at that point was that people were updating their working models. If that is the case, if we then probe their memory for a word that was presented just before that shift, retrieval of that information ought to be impaired. That’s what you see. So, retrieval of beaming after a moment later is quite a bit faster than retrieval of beaming after an hour later or a day later. That is consistent with the idea that at this point they updated their working model and the information is no longer as available. One more example [[is]] from our laboratory. I also showed you this narrative designed by Nicole Speer (2005) when she was in the lab. This is a story about Mary going camping. It says “she had just bought a new camera, and she hoped the pictures would turn out well. She could hear water running, and figured there must be a creek nearby”. Creek is going to be the critical item that we’re going to test. And like the Rinck and Bower’s (2000) study, we manipulated the amount of intervening information presented before a shift sentence. We could present zero, one, two or three sentences in between. Again, by the time you’ve read these three sentences, it is unlikely that creek would still be in your phonological loop. Then we read either a “moment later she was collecting wood for a fire or an hour later she was collecting wood for a fire”. In the first experiment, we present the word creek immediately after this or after other intervening sentences and ask people to verify whether that was presented. Then on filler items, we present words that were not presented and they’d have to recognize that they were not there. Here is what you find. If you present creek immediately after the shift sentence, memory is quite good; but if there is an intervening filler sentence, memory is less good. And it’s especially less good if it’s an hour later. So merely changing the word from moment to hour causes about seven or eight percent of our subjects in this situation to forget that they just read the word creek.

Updating Event Models

119

I am always looking for paradigms that allow me to assess cognitive function without interrupting that cognitive function. The problem with that recognition memory paradigm, in this study and the others that I’ve showed you, is that you have to stop people’s reading and ask them what they saw. An alternative way of testing memory is using anaphoric reference. If we have this definite article anaphoric reference, “she heard a noise near the stream”, there’s good evidence that people, in order to identify the referent, retrieve the synonym from memory. And if it’s the case that people are updating their event models at this point, that retrieval ought to be impaired. That’s what you see. Here you see the effect most strongly in the case where there’s no intervening items; you can see that they are slower to retrieve that anaphor, stream, if it’s an hour later rather than a day later. These studies converge to the point of repetitiveness in indicating that after narrative shifts, in particular shifts in space and time, information is less accessible in readers’ memory. Recently, we have been very interested in picking apart more carefully the kind of updating that is going on and that has emerged in the question and answer periods of some of the last few lectures. There is a really important distinction to be made between two kinds of working model updating. One thing that you can do as you encounter a change in the situation that you’re experiencing is a global reboot of your event model. Event segmentation theory proposes that when you encounter a change in the environment that leads to a prediction error which induces you to reboot your event model. We experience that as a break in the subjective present as an event boundary. We just rebuild our event models from scratch. Some information may be carried over from the previous model, but information that is not changed is just as likely to be updated on this account as information that changed. You can contrast this with the kind of updating that’s proposed by models like the event indexing model. In event indexing model, as you read that a dimension of the situation has changed, you update that dimension. You do this continuously from clause to clause to clause. I know at least one theory that proposes both kinds of updating. Morton Gernsbacher’s (1990) structure building theory posits that as we read a narrative, we map new information into our current event representation, but then from time to time we have to shift to build a new structure in our memory. It includes both kinds of updating. Surprisingly, when we looked at this, we could find no evidence in the literature as to what people actually do. Do they do incremental updating? Do they do global updating? Do they do both? Heather Bailey and Christopher Kurby and others (2017) in the lab have started working on this problem. Here is one paradigm. I’ve showed you these

120

Lecture 7

narratives to look at the effects of shifts on event boundaries. They allow us, in a really nice way, to look at the relationship between incremental and global updating. Just to remind you, these are narratives that include shifts in spatial information and in characters. But now I’ve added some additional information we can insert into the narratives memory probes for either spatial or character information. If you read something like “Jim picked up his keys from the basket by the front door and paused” and then read “the basket was supposed to be a place for just keys, but his keys were always buried under everything else in there. Jim hated how it became a place to keep junk. From now on he would have to keep it clean, he bowed” and then read “he found his keys and walked into the garage”. You’ve shifted the spatial location. Then, we can probe information about the previous spatial location by the front door and ask people, “was this phrase presented in the text?” Later in this story we might have something like “On the top shelf in the corner, Jim saw the box his wife had conveniently labeled ‘camping gear’”. Then we might read some fillers and then read “Walking in to the garage, Kathy laughed at the pile of stuff surrounding her husband”. In this case we’ve introduced a new character into the mix, but now we’re going to again probe spatial information from before that shift. If we’re updating information incrementally—updating just that information that changes—as we read, then we should update spatial information at this spatial shift and this ought to be less accessible. If we are updating globally, this also ought to be less accessible. But in this case here, if we’re updating incrementally, the spatial information hasn’t changed, we’ve shifted a character and not spatial location, so this information ought to be fine. It ought to be just as accessible. Whereas if we’re incrementing globally, if we’re incrementing all the dimensions of the situation at the shift in response to an event boundary, then this information would be less accessible. What people did in these studies is read them for comprehension and respond to these interspersed memory probes and recall that we had three groups of readers. One group was just told to read for comprehension, the other group was additionally told that they were going to write a character profile, and the third group was told that they were going to have to draw a map so they ought to be focused on the spatial information. These folks ought to be focused on the characters. Just to remind from Lecture Four, compared to the no shift condition, if we shift—introduce a new character or change spatial location—people identify those points as event boundaries. But they are more likely to identify an event boundary in response to the spatial shift if they’re attending to space compared to if they’re attending to characters. This is evidence that, again, the degree to

Updating Event Models

121

which people track space while they’re reading depends on their goals. But the key question here is when they hit these event boundaries, did they incrementally update just the dimension that changed or did they globally update the dimensions that remain the same? If people are globally updating and we measure their response time to verify one of these probes, if there’s no shift they ought to be fast, and if there has been a shift and they have updated their event representation, they ought to be slow. It ought not matter whether we’re probing the dimension that changed or the other dimension. In other words, if I move the action to a new spatial location and I probe character information, that ought to be impaired as well. On the other hand, if people are updating in a purely incremental fashion, then if I change spatial location, I shouldn’t update character information; if I change characters, I shouldn’t update spatial information. So those ought not to be slowed in those cases. The changed information is predicted always to be slower. What’s of critical interest is the status of the unchanged information. We’re going to measure how long it takes people to respond to these three kinds of probes, and we’re going to look the effects of shifts on retrieval of the change dimension and the unchanged dimension. We’ll use regression models to control for a bunch of extraneous variables. Most notably it takes people longer to respond to longer probes. Here’s what you see. Let’s start with the control condition. Here are the character probes. Compared to the no shift condition, people are slower when you shifted space and probe character or when you shifted character and probe character, and those are both about the same. If you look at spatial probes, if you shift a character and probe space, people are slower; if you shift space and probe space, they are slower yet. If this was the dominant result, this would be consistent with strictly global updating. Whereas this kind of result is consistent with a mix of incremental and global updating. What you see across all the conditions is a pattern that’s much more consistent with incremental updating. The unchanged dimension is slower compared to the no shift condition, which is an indication of global updating; but the change dimension is slower yet, which is an indication of incremental updating. We also see evidence that what people are attending to modulates this to some degree. If we look at probes of character information, the effect on character information in response to a spatial probe and the effect of spatial information in response to a spatial probe, these are the two conditions in which space shifted and you can see that there are much larger effective spatial shifts when people are attending to space compared to when they’re attending to characters or to the control condition. What this indicates is that people are probably habitually attending

122

Lecture 7

to characters, but if we ask them to attend to space they can. That again is consistent with the effects we saw in Lecture Four, where the effects of spatial shift on reading time depended very much on what people were attending to. So, updating reflects a mix of incremental and global processes. Readers’ default focus seems to be on characters. Attending to space leads to more tracking of space, and to more updating when spatial location shifts. The last study I want to tell you about returns to the Raymond stories and to the functional MRI paradigm that I described in the previous lecture. You’ll recall that in this study people read story paragraphs, one word at a time in the MRI scanner. They read two kinds of paragraphs. One kind of paragraph presented an intact little narrative and the other kind of paragraph presented sentences, each individual sentence made sense, but they were lifted each one from a different narrative, so there was no coherent situation that was described by the sequence of sentences. Then afterwards people were given yes or no recognition memory tests for the sentences and for comprehension. These are examples of those two conditions. I read these last time. I won’t repeat them. We’re interested here in what the temporal dynamics of the neural response in different brain regions can tell us about updating. What should the neural response look like over time if a system is doing global updating? What we’d expect to see is that there be some transient response at the onset of the paragraph. Then if the situation is coherent, then we’d expect to see some sustained response. If you’re doing global updating at the beginning of the paragraph, there ought to be a big transient response for the story condition. But in a scrambled condition, you ought to have to globally update at each new sentence and so you ought to still have some response with each new sentence as the paragraph goes on. If you’re doing incremental updating, then what you ought to see is a rise as new information is mapped into the situation model. That ought to be present really only in the story block. Here, what we’re showing is a medial view of the left hemisphere, a medial view of the right hemisphere and then posterior views of the left and right hemisphere. We see lots of areas of the brain that show big transient responses when the paragraph starts and often transient responses at the offset. But if those responses are identical for the story and scrambled conditions, that’s not a very good candidate for being involved in global or incremental updating. It probably has something to do with shifting from staring at a fixation cross to reading and then shifting out of reading. But if you look at an area like the right posterior parietal cortex here, this is one of these higher level association areas that shows global decreases through a lot of meaningful cognitive

Updating Event Models

123

tasks. It shows a big initial phasic response. Then in the case of the story condition, it goes away; but in the case of the scrambled condition, it stays high throughout. If you look at rises throughout the story, we see several areas of the brain that show evidences of a linear trend that increases throughout reading the story. For example, back here in this extra strike visual cortex and the inferior temporal gyrus, we see increases throughout reading the story, but they’re equivalent for the story in scrambled conditions, so they’re not really good candidates for being involved in incremental updating, whereas this region here in the temporal pole shows little response in the scrambled condition, but shows a strong increasing response when the sentences can be integrated into a coherent story. These parietal regions show evidence of global updating. Remember if you think back to Lecture Five, the neural correlates lecture, these areas show event model effects in lots of different paradigms. The temporal pole, a region that’s strongly associated with semantic representations and a place where lesions or neurological disease creates a semantic dementia, those regions show evidence of incremental updating, suggesting that maybe access to the lexical semantics associated with each new piece of information is necessary to map new information into your event model, for both incremental and global updating. These are not primary sensory areas, but high level association areas that show the effects. The big theoretical question that I want to ask is: How can we accommodate this evidence for multiple kinds of event model updating within accounts of event perception and cognition? Event Segmentation Theory, as we’ve discussed in the previous lectures, has a pretty simple updating mechanism. All it can do is registered that there’s an increase in prediction error and reboot an event model. All it can do is global updating. One possibility is that’s just wrong and we’ve got to complicate the architecture. I’ll do that if we have to, but I’m reluctant because it’s unparsimonious. The more mechanisms we have to build into the model, the more complicated the model is and the less explanatory power it has. Another possibility, which to me is more attractive theoretically, I don’t know if it’s more likely or not, is that we’ve got separate memory systems. The fact that we have a working model system doesn’t mean that we don’t also have simpler activation-based memory systems as well. One possibility is that simpler systems are akin to Baddeley’s (2000) phonological loop and visuospatial sketch pad, updating continuously and incrementally, but event models really do update strictly globally. We’re currently trying to sort out which of these is a better fit to the facts on the ground.

124

Lecture 7

To conclude, I want to say that event models have more capacity than classical working memory systems. But their capacity is still limited. Actors in real life activity, viewers of screen activities, and readers update their models at event boundaries. Theories of event comprehension need to incorporate both incremental and global mechanisms because the evidence is clearly that they’re both present. And I’ll stop there. Thank you! .

Lecture 8

The Event Horizon Model and Long-Term Memory I want to switch gears a little bit in this lecture. Thus far in this series we’ve been considering events that are experienced in the hear and now or events that have just ended. I want to turn now to cognition about events where we’ve experienced a number of events in our recent past and have to retrieve information selectively from one. I am interested now in the relationship between events and long-term memory and, in particular, I want to raise the question of how it is that we access information from one episodic event representation in the face of all the others. What determines which events in our long term memory are accessible and which ones are not? Gabriel Radvansky and I (2014) have proposed a model that governs access to events in long-term memory: the Event Horizon Model. The Event Horizon Model proposes five principles of episodic memory access. The first, those who have been here for the previous lectures will be very familiar with by now: It says that continuous ongoing activity is segmented into discrete event models. The second says that the event model corresponding to the current activity is actively maintained. In this morning’s lecture, we talked about the distinctions between representations that are actively maintained by recurrent neural firing and representations that are maintained in virtue of synaptic changes. The current event model depends on active neural firing whereas representations of past events depend on LTM which stands for long-term memory, by which I mean permanent synaptic weight changes. The third principle holds that long-term memory links events by their causal relations. We have been focusing thus far a lot on relations within events. When you turn to relations amongst events, one of the major organizing principles turns out to be causal connectivity. The fourth principle proposes that when elements of events are represented in multiple event models, access to those elements is facilitated. If there is a feature of my recent experience that occurs across multiple event All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982479.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_009

126

Lecture 8

models, then access to that information is going to be facilitated. For example, in the last few days, I’ve seen professor Thomas Fuyin Li many times, and so if I have to access information about Thomas, that should be easy to do. However, the fifth principle, competitive event retrieval, holds that when several event models have overlapping elements, accessing any specific event model is more difficult. So if I’m attempting to retrieve when we talked about a particular topic—did we talk about it at lunch yesterday or dinner the night before?— that’s very challenging if there is overlap in the features such as the other people who were present and the locations which those events occurred. I want to focus in this lecture in particular on these first two features, segmentation and working models, and ask: What are the effects of segmentation during comprehension on subsequent memory? In other words, what are the episodes in our episodic memory? My field of cognitive psychology has gotten by with this term episodic memory since it was coined by Endel Tulving in the 1980s, without thinking very carefully, to be honest, about what the episodes are. In the last few years, there has been an exciting research on this topic and I want to introduce some of that today. I am going to start, however, with a little bit of older work which suggests that the structure of events segmentation during perception organizes the anchors in our episodic memory. I’d like to start by just reminding you of a result from this morning’s lecture. Remember we saw that if someone is watching an event unfold, and there’s a change which is going to come along in about seven seconds, which will be a major event boundary, we saw that after that event boundary, access to information immediately presented before is difficult. But we saw this interesting finding that information that was presented in the previous event was accessible in immediate memory if it had been encoded during a previous event boundary. I just want to remind you of this interesting cell, where we’ve crossed over into a new event, but memory for a recently presented object is very good, in contrast to a case where we cross over to a new event but the object wasn’t presented during a previous event boundary, and memory is very bad. What this suggests is that encoding during an event boundary, as illustrated by the white bars can be protective against things that cause forgetting, such as transition to a new event. Ok, that’s a very short-term delays. Those are five second delays in that study, but the positive effect of event boundaries on memory holds with quite a bit longer delays. This is an older experiment from Darren Newtson and Gretchen Engquist (1976), in which they showed people a brief movie of a person performing an everyday activity around the office and then after a delay they

The Event Horizon Model and Long-Term Memory

127

tested. Each person saw either the beginning the first half of the movie or the second half of the movie. And then after a delay, they saw slides taken from the movie that came from both halves, so both the half that they saw and the half that they didn’t see. Those slides were taken either from boundaries or from nonboundaries. What you find is that people are less able to correctly recognize slides that they saw if those slides came from nonboundaries, and they are more able to recognize slides that they saw that came from boundaries. This holds both for those who watched the first half of the movie and those who watched the second half of the movie. There’s also a hint in the group that watched the second half is less able to correctly reject slides that came from the half of the movie that they didn’t see if they come from nonboundary positions. These differences indicate that boundaries are privileged in their episodic memory. A similar design was applied in a little bit more controlled setting by Stephan Schwan and his colleagues (2000) recently. They showed 40 participants a movie of a person upgrading a computer or cleaning a police pistol. Each movie was edited in two versions: they either edited so that there were cuts where one shot of film ended and another began at the event boundaries that the places that observers had told them were breaks between one meaningful unit of activity and another, or cuts at middles. After viewing, participants recalled as much as possible from the movie that they saw, and they coded whether each action was recalled as a function of whether that action took place near a boundary or in a middle near a nonboundary. They found that for that both for films that were edited with cuts at boundaries and films that were edited with cuts at nonboundaries, information at the boundary was remembered better. So it’s not necessary that there be a visual cue like a film edited at the boundary. In both cases, you see that memory for boundary information is better than memory for nonboundary information. Very recently, Aya Ben-Yakov and colleagues (2013) have presented evidence that indicates that there is something special going on in neural systems associated with episodic memory formation at event boundaries. The design of their experiment was to show sequences that included eight second movie clips and each clip could be followed by another movie clip or by a scrambled pattern, and then followed by a fixation cross and participants were asked to just look at the middle of the screen throughout the experiment. And after watching these movie clips while their brain activity was recorded with functional MRI, there was a brief delay and then they were presented with a cued-recall test in which they were given a still picture from the beginning of a movie clip and asked a question about that clip, and then they rated their confidence in their response.

128

Lecture 8

Each clip might be followed by another movie clip, a scrambled clip, or nothing. Then, what we can do is [[to]] plot the MRI signal time-locked to the offset of the clip. They plotted activity in the right and left hippocampus. As we saw this morning, the hippocampus and associated structures are strongly associated with episodic memory formation and are highly activated when one has to reach back into the past and retrieve information from a previous event. What you can see, here, is that when a clip offsets and is followed by nothing, there is a large phasic response at that boundary at the end of the clip. And when there is a transition between one clip to another clip, there is a large phasic response at the end of each clip. For those who were here for the fifth lecture, I’ll just remind you that the fMRI responses is lagged by several seconds relative to the neural activity, so these two bumps correspond with neural activity at the end of each of the clips. So, when people are watching a boundary either between a clip and nothing or a clip and another clip, there’s a spike and the fact that you get this spike here when the other clip begins tells us that it’s not just noticing that the movie has ended. But the thing that tells us that this really has something to do with memory formation is that the magnitude of that bump at the end of the clips is a very strong predictor of how much they subsequently remember. On trials when there’s a large evoked response at the event boundary, participants remember the events better; on trials when there’s a smaller evoked response people don’t remember as much. And this is true for both the left and right hippocampus. What this suggests is there’s something going on in that nervous system at the event boundaries that is important for encoding the information into subsequent memory, and variation in that response is related to subsequent memory. Okay, so if there are processes happening at event boundaries that are central to episodic memory formation, then we ought to see the footprints of that episodic memory formation when people go to retrieve. And the question that Ezzyat & Davachi (2011) asked in the following experiment was: “Does events structure guide retrieval of sentences in memory for narrative?” This experiment used narrative event materials to ask whether information within an event is bound together more strongly in memory than information across events. In general, when one reads a narrative, if one then is cued with a sentence from that narrative after a delay, this prompts retrieval of the sentence that followed the cue sentence. So, one good way to remind a person of something that was in a narrative is to present them with sentences that came nearby. The hypothesis that Ezzyat and Davachi (2011) proposed was the cuing will be stronger if the following sentence is part of the same event, and weaker if the following sentence was the beginning of a new event.

The Event Horizon Model and Long-Term Memory

129

This can be illustrated here. Participants studied narratives that included sequences like this: He turned on some music to help him focus on his work. A moment later, he discovered some useful information and made a few notes, or a while later he discovered some useful information and made a few notes. In both cases, this sentence follows an initial sentence. In one case, this ought to be still part of the same event and in the other case this ought to be the beginning of a new event, so people see this sentence for six seconds, they have a brief delay that varies from trial to trial, and then they see the following sentence. In this case, there’s an event boundary in this case there is not. The hypothesis is that if someone saw this and then this, and we cue this sentence, that ought to strongly prompt retrieval of this one. If they saw this and then this, the retrieval of this ought to be weaker because it’s part of a new event. The idea is that the chunk that is retrieved is bounded by these event boundaries, and if we create an event boundary by making it a while later, that will suppress the chunking and retrieval. That’s illustrated here. So, as one is reading along in a narrative, one might read a sentence, “He turned on some music to help him with his work”, and then read “a while later”, and you’d have a boundary before the next one, whereas later in the narrative you might have another pair, similar pair but with no event boundary in between because it was described as “a moment later”. Then, we can cue memory with either one of these preboundary sentences or one of the precontrol sentences and ask how well people remember. What you find is that if you measure memory for the preboundary and precontrol conditions, memory for the boundary is reduced relative to memory for the control. In both cases, there is a cuing effect such that you have an improvement of memory following the cue, but it’s reduced when there’s an event boundary in between. I want to talk next about the organization of a memory on the longest scale that I’m going to talk about in any of these ten lectures; that’s on the time scale of autobiographical memory. This is an experiment from Williams, Conway, and Baddeley (2008), in which they brought college students into the lab and asked them to recall the last time they came to the university, or a vacation they had within the last few weeks. The participants then looked over the transcripts of their recalls and segmented them for the experimenters into meaningful events, and then they came back a week later and they did the same thing again—they recalled the initial trip that they had taken to the lab and they recalled that same vacation again. Coders then went back to their

130

Lecture 8

transcripts and described each of the clauses, in terms of whether it described as an action, a thought, a sensation, or a fact. An example is illustrated here. Someone might write walked to the university, along the edge of the park, it was raining a bit, but not enough to get wet, was busy talking, so walked past the Psychology building, and only noticed, when we were at the Student’s union, turned back, and went into Psychology building, remember thinking that there weren’t many students about, all the roads seemed quite quiet, walked up the hill, the worst part of the walk. They then tell the experimenters, “Here are where I experience event boundaries in that,” and then the experimenter goes back through and says, “Okay that’s an action, that’s a fact, that’s a fact, that’s a thought,” and so forth. We can ask what kind of information is presented at the beginning of an event, the middle of an event, or the end of an event. Here is what you find: if you look at the beginnings of an event, the first clauses after a boundary, you find that on both occasions, the most frequent kind of information provided is information about actions. If you look at descriptions of ends of events, you find that the most frequent information reported is facts. So, parts of an event, as we saw in the Ezzyat’s (2011) study cue other parts of that event. Events in memory have internal structure to them. Those chunks have structure to them: They begin with actions and they end with the results of those actions, which are usually facts. All of these things suggest that there’s something happening at the boundary as we go from one event to the next that is especially related to the formation of long-term episodic memories, such that cuing works better within an event than across events and there is a chunk structure such that different kinds of information are represented after event boundaries than before the next event boundary. I want to say a little bit about the mechanisms that support our relationship between segmentation and memory. Suppose that I give you a string of characters to memorize. For most of us, this string would be too long to hold in our memory. But if I gave this to my undergraduates, they would probably notice that it consists a bunch of abbreviations that they use ubiquitously in texting: “Mind your own business, laugh out loud …” I won’t pronounce all of them. Once they recognized that this, the list would be well within their capacity. This is initially shown by George Miller and his colleagues: If we can chunk a series of items, that can greatly raise our capacity and you can take this to

The Event Horizon Model and Long-Term Memory

131

extremes. In a famous study by Ericsson, Chase, and Steve Faloon (1980), an undergraduate at Carnegie Mellon committed his semester to learning to memorize digit strings. His job was to train himself so that someone could read him a string of digits and he could repeat it back—trying to make that string of digits as long as possible. By the end of the semester, he got up to strings 80 digits long, so we could read him out an 80-digit phone number and he would be able to repeat it back perfectly. Since the time of this study, memory competitions have become a thing, so there are competitors out there who can do three times this length and they, like Steve Faloon, exploit chunking. It turns out he was a runner, and the strategy that he spontaneously developed was to code chunks of numbers as running times like, “these four numbers would be a good marathon time”, “these four numbers would be a good ten-K [[kilometer]] time”. And by chunking things like that, he was able to radically increase his capacity. The memory competitors pre-memorize a vocabulary of a large number of chunks. So if I can memorize 10 chunks, then I can treat two-digit combinations as one thing. If I can memorize 100 chunks, I can treat three-digit combinations as one thing. And these guys are now doing four or five digit chunks. They are learning to associate a five-digit number with a picture in their mind and then all they have to remember is a sequence of pictures corresponding to those five digit numbers. You can imagine that learning to memorize an association between five digit numbers and pictures takes a lot of preparation, so these people train for hours a day for years to do it. The point is that chunking can be totally transformative for your long-term memory capacity. So, if events are the episodes in episodic memory, then forming good ones may be related to the quality of subsequent recall in the same way that forming good chunks of numbers or letters is related to good subsequent recall for simpler materials. And so, the hypothesis is that effective event units should facilitate coding and search and thereby improve memory. We have tested this in a number of studies. The most comprehensive test came from a lifespan sample in which we tested 208 participants ages 20 to 79. Each participant came into the lab for a total of five hours over two days. We gave them multiple measures of their ability to segment events and their ability to remember them later. We gave them a test of their event knowledge which I’ll illustrate in a minute, and then we gave them a three-hour psychometric battery using a standardized test to characterize their cognitive ability in the way that we would typically do so in a neuropsychological clinic or in the lab. We gave them multiple measures of how quickly they could process information, how much simple information like digits and words they could hold in mind at once. We tested episodic memory the way we usually measure it in the lab, which is to ask people to memorize arbitrary words or arbitrary pictures. Then we gave them measures of their knowledge—they had these

132

Lecture 8

synonym and antonym tests where they’d give either similar meaning words or opposite meaning words, and trivia questions of the sort you might get at a trivia pub night. The participants all watched these three movies—one of a man preparing for a party, one of a woman making breakfast, one of the man planting house plants—segmenting them to mark them into natural and meaningful units, and then their memory for the activities was tested after a delay. Now, I just want to remind you that when people are asked to segment a movie like this, there is a good agreement about where the event boundaries are. Here I’m showing (actually this is a different movie from an older study, but these are equivalent data) that if we divide the movie up into one-second bins and count the number of people who segmented in each bin, in this study like in this older study, participants segmented one time to identify coarsegrained event units and another time to identify fine-grained event units. In both cases, there are places where lots of people identify event boundaries and there are places where basically nobody identifies an event boundary. So, there’s a good agreement across observers about where the event boundaries are, and also, to recapitulate from that third lecture, we can quantify how well a given individual agrees with their compatriots. The way that we do that is by binning the data so that for each interval, say a one-second interval during the movie, we note whether the participant identified the event boundary at that point, and then we construct a norm from the segmentation probabilities of all the viewers. For each individual, we can ask: “How well does their segmentation agree with that of the group?” We can do that by correlating the time series for them as individuals with the summed series for the group. In this case, what we’re going to do is to rescale the correlation to control for differences between people who segmented more frequently and people who segmented less frequently. We’ll rescale the correlation so that the best possible score you could achieve given how many event boundaries you identified is one, and the worst possible correlation you can achieve given how many event boundaries you identified is zero. Here is an example of someone recalling one of these events and this person is doing pretty well. I’ll just read you the beginning of it. They say a young lady came into the kitchen, washed her hands at the sink and dried them off, and went and got a skillet—I guess, er, it looked like a skillet to me, put it on the stove and then went to the refrigerator, got some things out of there, I think milk, some other things. She went to the cabinet and got a dish. This is a pretty detailed and quite accurate description of the activity.

The Event Horizon Model and Long-Term Memory

133

This is the event knowledge test. Rosen and colleagues (2003) had collected normative data for what American adults from a range of ages think are the typical steps for a number of common activities. If you ask a large number of people to list what typically happens when you shop for groceries, these are the things that they say. A very typical response would be something like: “you determine the items you need, you make a grocery list, you cut or gather coupons” (now this is done a little while ago—I guess in the electronic age coupons are gone, but so you clip coupons electronically), “you get in the car to go shopping, you drive or go to the store, you park the car at the store”, and so forth. We asked one of our participants what usually happens when one goes shopping, and they listed a number of things and here what they said was: “you make a grocery list, then you count grocery money, decide on the grocery store, get your keys and purse, set the alarm, close and lock doors, take your key out, get in the car, drive to the store, park”—so they say a bunch of the things that are in the normative list and then they say some other things. These aren’t wrong. They’re just not typical, so they are not going to count in their favor. We are not going to penalize them for these other ones, but they don’t count toward their accuracy. So higher scores are listing more of these items. We score the proportion of normative steps reported, and then we can ask, what is the relationship between their segmentation agreement, or their event knowledge and how much they remember later? Across all the participants, this is the relationship between segmentation agreement (how well they segment the activity) and recall performance (how many of the steps in the activity they correctly recalled). You can see it’s a very strong relationship. It accounts for about a quarter of the variance. Similarly, we can ask what’s the relationship between how much they know about how typical events unfold and how much they remember. And again, there is a very strong relationship accounting for about a quarter of the variance. So, people who know more about how typical events go remember more. Now, when I saw these data, the first thing I thought was, “Maybe these people are just more clueful than these people, and they’re going to do better at any cognitive activity that we test them on.” The reason that we gave all these other measures was to ask what aspects of cognitive function are just common such that people who are doing well cognitively will be on one end of this line and people who are doing poorly cognitively will be on the other end of the line, and what aspects are really specific to event cognition. To ask this question we used structural equation modeling to describe the statistical relationships amongst all of the large number of variables in this data set. The key questions we wanted to ask are: Do segmentation agreement or event knowledge predict memory after controlling for everything else?

134

Lecture 8

I’ll present the results of the structural equation modeling in stages. A structural equation model gives us an estimate of the strength of relationship between cognitive constructs, controlling for all the other things that were measured. We can first ask, what is the relationship between age and education in this sample? It turns out that our older adults were a little bit more educated than our younger adults. The theoretically interesting questions start when we ask, what is the relationship between age or education and the standardized cognitive measures that we give? We had three measures of each construct— working memory capacity, episodic memory, perceptual speed, and crystallized ability. I described those measures earlier. All of them are positively related to education, as you’d expect, and all of them except for crystallized ability are negatively related to age. What you typically find—and I say this with chagrin as someone approaching late middle age—is that as we age, our working memory capacity declines, our episodic memory capacity declines, and our perceptual speed declines. However, crystallized ability is generally stable or even increasing and there are lots of other measures on which we older adults do fine. All of these things are related to each other, so that is all as it should be. So far things are looking sound and reasonable. Then we can ask, What is the relationship between these standardized measures of cognition and our event cognition measures? It turns out there is only one significant predictor of the event cognition measures; working memory capacity is a significant and strong predictor of how well you segment, but none of the other relationships can significantly once you control for the event measures. But the real kicker is, What predicts event memory after we’ve controlled for all of these things [[the cognitive constructs]]? Are these things [[event segmentation and event knowledge]] independent predictors of event memory? And the answer is yes; both event knowledge and segmentation independently predict how much you’re going to remember from a movie after we’ve controlled for all of these things and, strikingly, none of these are significant predictors of event recall after we control for these. So, in other words, if I take two people off the street who test just as well on our standardized measures— the kind of things that will keep you out of a neurological clinic and get you into college—but one of them segments activity better, that person will remember more of the of the events that I show them. I think that’s kind of striking. Working memory predicts segmentation. Segmentation and event knowledge both independently predict event memory. So a natural question to ask is, Can we intervene on this? Can we do something to take somebody who’s down here and move them up this curve? Can we improve their memory by improving the encoding processes that drive their event cognition? Let’s take a little break and then I’ll take up that question next.

The Event Horizon Model and Long-Term Memory

135

What I want to tell you about next is some efforts that we’ve been making in the lab to move people up that curve and try to intervene on their encoding of events to improve their subsequent memory. I will tell you about two strategies for doing so. One is to modify the presentation of events to highlight the boundaries, in the hope that will provide better anchors for long-term memory. The other is to modify viewers’ cognitive habits by instruction or by training. So far, we’ve only got as far as looking at instruction; we haven’t gotten into training yet. Let’s start with this first one. In this study led by Dave Gold (2016) when he was a postdoctoral fellow in the lab, we took movies of everyday activities (you’ve seen this one of a woman making breakfast), and we edited them to highlight event boundaries. That’s a point that our viewers identified as an event boundary. For some of the viewers, we would slow the movie down a little bit at that point and then ring a tone that we called the Gong of Doom, and then it would go on. Last but not least, we’d have an arrow pointing to the object that she’s working with at that point. They get this set of cues saying, “This is an event boundary you need to pay attention”. Our hypothesis was that the slowing and the sound should induce a prediction error that would help them segment. In the first experiment, the movies were edited with a pause, the Gong of Doom, a pointer, and then after the edited movies, they were given the opportunity to review a slide show of the points at which we had paused the movie. One group of participants got all of these cues at the event boundaries. A second group of participants we tried to lead astray by providing all that cue information at the middles—the points that we thought would be least adaptive to segment the activity. A third group got the movies without any editing at all. In this first experiment, the unedited condition didn’t get any post-movie review because we couldn’t quite figure out which pictures to show them after the movie. The cuing-the-middle condition, I think is really interesting, because on the one hand by editing the movies to highlight the wrong information, we ought to be messing them up, but on the other hand, we had reason to think that the simply chunking the activity might be helpful. That is to say, that bad chunks might be better than no chunks. Not that anyone is ever not segmenting, but it might be that it just getting people to attend to their segmentation, to stop and pause, might be facilitatory even if we were cuing the wrong locations. In the second experiment, we got rid of the pointer and we added a postreview to the unedited condition. The way we did that was by showing them a sped-up version of the whole movie. We didn’t want to pull out any particular points in the movie because then we’d have to choose either boundaries or middles, so we showed them a high-speed version of the movie. In both

136

Lecture 8

experiments, we had younger adults and healthy older adults. A lot of this work in the lab has focused on aging and Alzheimer’s disease, because that’s a population that has great concerns about their event memory. In tomorrow morning’s lecture, I am going to turn in detail to aging and Alzheimer’s disease. Okay, then we code the activity by breaking it down into lower-level actions. We can note each action in the movie and ask whether it occurred at an event boundary or at a midpoint or somewhere else in the movie. Is it an action that’s going to be cued in the good condition, or cued in the bad condition, or all the other actions in the movie? Then, we scored their recalled protocols to ask, how many of the boundary actions did they recall and how many of the midpoint actions did they recall? And here’s what you see. There are three types of action: the ones that happened at the event boundary and are cued in the boundary condition, ones that happened at an event midpoint and are cued in the midpoint condition, and then the ones that are never cued. What you find is that compared to the unedited condition, cuing the event boundaries boosts recall for the information at those boundaries quite a bit and it does so for both the younger adults and for the older adults. Cuing the event middles doesn’t impair their memory, so the event middle group is doing no worse than the event boundary group. Cuing event boundaries facilitated recall for the event boundaries. It is a little surprising the event middles are so good, and cuing them didn’t hurt. We thought that maybe the pointer in the post-event review helped independent of the timing, so we took away the pointer and added the post-event review in the second experiment, and we thought this might reduce the difference between the control group and the middle group that impair the middle group compared to the control group. Here are the data from the second experiment. What you see is that it is strikingly like the first experiment. Again, cuing the boundaries helps recall of that information from the boundaries, and again, cuing the event middles doesn’t seem to mess people up. This suggests that cuing the event boundaries does help as predicted, and as a practical aside, interfaces that just encourage people to form chunks, even if they’re not the greatest chunks, are helpful for memory encoding. The other approach that we have taken to ask whether we can functionally intervene on segmentation to improve memory involves just simply trying to get people to attend their segmentation mechanisms more. We think that people are segmenting ongoing activity into meaningful events all the time. But if we pay attention to our segmentation, that might help clean up the products of the ongoing mechanism and improve its ability to give us adaptive event units. And we wanted to know whether just forcing

The Event Horizon Model and Long-Term Memory

137

people to attend to their segmentation would improve their memory and if so, is this an intervention that would just have a fleeting effect or is this something that might have effects on time scales we actually care about? The intervention is really simple. I have told you over the last few days about a number of studies in which we asked people to watch a movie and segment it into meaningful events by pushing a button whenever one meaningful unit of activity ends and another begins. That’s the whole intervention here. Some of the participants are going to be asked to do that task and others are not. The basic scheme for all of these experiments is that you watch several movies. Some of the participants watched them passively, with just instructions to try and remember as much as possible. Some of them segmented them into meaningful events while trying to remember as much as possible. We were a little bit concerned that the act of having to decide and push a button might have effects on their memory encoding, either positive or negative, and we wanted to control for that, so in some of the studies we asked people to try and push the button every 15 seconds. These people are pushing the button just about as much as the people who are doing the event segmentation task, but they are doing it just to mark time. Then we tested their memory at delays ranging from immediate (no delay) to four weeks. We tested their memory in two ways, using recall and then a recognition test. I’m going to show you data from five experiments including a total of about fourteen hundred subjects. One of them was conducted in the lab and the other four were conducted on Amazon Mechanical Turk. I’m curious: has anybody in the room used Mechanical Turk for data collection? This is a great resource. This is a service developed by Amazon where participants can sign up to perform brief cognitive tasks for micro-payments, and it was invented to give people a way, using distributed mass testing, to do tasks that artificial intelligence ought to be able to do but just can’t do yet. The kinds of things that it’s been used for are decoding characters on pictures of signs or searching for objects and satellite photographs, so these are things that you’d want a computer to be able to do for you but the computer can’t do it quite well enough yet. So you can create a little website where someone can look at a picture and tell you if there’s a silo in it, and then thousands of people can come and look at the pictures and code your data. Amazon Mechanical Turk turns to be a great platform for crowd sourcing psychological experiments, so we used it here. Here is an example of one person’s recall protocol from watching that breakfast movie and this is one of our really good participants on Mechanical Turk. You can see that it is a long detailed recall protocol, comparable to the one that I showed you before, maybe even better, whereas here’s an example of one that’s not so good.

138

Lecture 8

A woman entered from the left. She then obtained ingredients. She moved around the kitchen, working on breakfast. Eventually, she exited the room. That’s true, but it’s not great. Here’s a paradox: Using Mechanical Turk, we can test hundreds and hundreds of participants and we can get recall protocols from all of them, so the recall data are cheap to acquire. Up to the point that we started these experiments, we had been hand-scoring each recall protocol by comparing it to a list of all the small level actions in the movies and scoring exactly which actions that participant had correctly recalled. That’s what you’d want to do. It takes about twice as long to score the data as it takes to collect the data. These experiments only take a few days to run. Scoring that amount of recall data would take us years, and we have not figured out a way to put it on Mechanical Turk. The recall data are cheap to acquire but they’re really expensive to score. We spent a lot of time trying to think about how we could come up with an approximate measure that would tell us roughly how much they remembered, so that we could handle this massive amount of data. I’ll tell you, we spent a lot of time using computational linguistic techniques and we had a lot of fun with it. There is a great parser from Stanford University that we used to break up the utterances into their syntactic components and then we used statistical semantic models to compare those components to the elements and the rubric. We found that we could get a measure that correlated pretty well with what our human readers said that people remembered, so the automated measure could tell us with a correlation of about .7 or so what the human rater would have told us. But then the embarrassing thing was that the very first thing we tried was just counting the number of words in the recall protocol—and that worked just as well, so all this effort turned out to be for nothing. I just want to show you a little bit to validate that this measure—simple word counting measure—actually works. To test the measure, we reanalyze data from that large-scale lifespan study that I told you about before the break. 200 participants, 3 movies each. Those had all been scored by hand. It took us about two years to do that whole experiment, and so we had multiple raters coding each recall protocol, we had high inter-rater reliability, and the readers were highly trained. We compared those hand ratings to a measure where we just counted the number of words in the protocol and divided by the number of actions that we’d identified in the movie. This [[dividing by the number of actions]] is just a control for the fact some movies are longer than others, and some movies have more actions than others. This measure correlated at .77 with our hand-scoring.

The Event Horizon Model and Long-Term Memory

139

What I’m going to show you in these studies is that normalized word count measure—our measure of recall. To take those two examples, the normalized word count for this good protocol is 2.4, and the normalized word count for this bad protocol is 0.2. So it comes out like you’d expect. The second measure of memory that we’re going to use is a recognition test. This is a screenshot from the Mechanical Turk website. People are told, “You are going to see pairs of pictures, and each pair contains one picture from the one of the movies you saw and one picture from a movie that you didn’t see. We want you to respond and choose the correct picture as quickly as possible.” Then you’d see a pair of pictures like this. In this case, this comes from the movie that you saw, and this comes from an alternate take in which she used slightly different props and did things in a slightly different order. This recognition test is hard, and it has the character that lots of times people feel like they’re just guessing. But in fact, people, as you’ll see, in general, do quite well. Let me show you the recognition data first. The way I’m going to plot all of these data are with performance level on the X axis and condition on the Y axis, and we’ll have separate plots as a function of delay and experiment. The error bars are always ninety-five percent confidence intervals. What you can see is that here with the ten-minute delay, people are doing better after segmenting the movie than if they were just watching with the intent to remember or performing the timing task. Then, we can zoom out from that one condition of one experiment to all the conditions of all the experiments. I’ve arranged them here by delay, so delay is on the columns and experiment is on the rows, and sometimes we have the timing condition and sometimes we don’t. What you can see is that at all delays, in all the experiments, the segmentation condition is numerically the best, with the exception of immediate testing. If we test immediately after presentation, the intentional encoding condition—just trying to remember as much as you can—works a little bit better than trying to remember as much as you can while you’re pushing a button. What we think is going on is that immediate memory can be driven in part by surface features of the activity: visual details that are quickly forgotten. Encoding of those visual details is impaired by having to do a secondary test, namely, make decisions and press a button. If you’re going to have to remember immediately after watching the movie, segmenting doesn’t help you, but if you have to remember what happened a month later, segmentation helps significantly. If we look at the data for the recall test, we can plot them the same way. And if you zoom out again, you see the same pattern: Immediately, people are a little bit better if they are just focusing on remembering the movie, but it all

140

Lecture 8

the other delays they are better after they segment it. It’s not statistically significant at the one-month delay, but it shows the same pattern as recognition. Another question that we were interested in is whether this is related to moving people up that curve, that is to say, amongst those who did the segmentation task, do we see this same relationship between segmentation and memory at these longer delays? In all the previous studies, memory had been tested either immediately or after about ten minutes, so we can look at the relationship between segmentation agreement and either recognition or recall memory in these data, just like we did in the study I told you about before the break. Now, of course we can only do it for those subjects who are in the segmentation condition. This is only a half or a third of the subjects depending on the experiment. But when you do that, what you find is that in virtually all cases there’s a significant correlation between segmentation agreement and recognition memory. This is present at the very shortest delay and it holds out for the very longest delay, and this relationship is even stronger in recall memory. So, at every delay in every experiment, there is a significant correlation between segmentation agreement and recall memory. Better segmentation is associated with better memory. Facilitating segmentation improves memory, and so far we have found two ways that seem to work: one is editing in cues to segmentation, and the other is asking people with the task to attend to segmentation. The next thing that we are going to look at is whether having experience with that segmentation task then carries over to another viewing where we don’t make you explicitly segment. If you’ve been thinking about segmentation and have practiced doing it, we want to know does that transfer to a new viewing? If so, then maybe we can refine these interventions to be clinically or practically significant for memory, and we think that maybe this mechanism helps explain some of the features of the medial temporal lobe memory system and how it changes with age and neurological disorder. To zoom out and summarize the overall topic for today, event boundaries are anchors in long-term memory. Events are structured units of retrieval. Effective segmentation is associated with better memory. In the next lecture, we’re going to see this play out really strikingly in healthy aging and Alzheimer’s disease and we can improve segmentation with interventions. I think the exciting frontier is to ask whether those interventions can really tell us mechanistically what’s going on and at the same time help us practically improve people’s memory. I’ll stop there and take questions.

Lecture 9

Event Cognition in Aging and Early Alzheimer’s Disease It’s an honor to be here and to be talking in this beautiful facility. Today I want to apply some of the topics and some of the concepts that we’ve been building up over the last four days to the first of two application domains. This morning’s lecture is going to focus on cognitive impairments and cognitive changes with aging and neurological disorder. This afternoon’s lecture at Peking University will focus on the application of these principles to media, to movies, and to stories. For those of you who are just joining us, I will try to be very careful to briefly recapitulate some background material as it becomes needed throughout the lecture. I want to start by motivating a little bit the problem of cognitive aging. This is a figure from the United Nations. They give these data nation by nation, and this is just showing projections for the number of elders in this country looking forward out to 2050. You can see that the good news is that nutrition and health care have come a tremendously long way over the last hundred years around the world, and so we’re living longer. It’s not bad news. But an important consequence of this is that the demographic balance of the world is tilting toward elders. Our economies need to reorient over the next decades to be economies that are not driven just by the relatively young, but that accommodate a range of ages that is shifted toward the elder. Our societies need to adapt in the same way. Therefore, understanding how cognition changes across the adult life span becomes increasingly important. We know broadly that cognition changes not just as we develop from children to early adulthood but throughout adulthood, and this could potentially have strong effects on how we understand and remember events. What I want to ask today is whether there are mechanisms that are specific to cognition about events that show large changes with age. When we look at language All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982485.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_010

142

Lecture 9

about events, perception, and memory, are there things that are specific to events that change with age, or does event cognition just change as a function of general changes in cognition? I also want to look at how all of this is altered by Alzheimer’s disease (AD). Alzheimer’s disease, as many of you will know, is a disorder associated with aging. It is not healthy aging, it doesn’t affect everybody, but it has quite a high prevalence and it becomes more likely as we get older. The presence of a population that develops AD amongst us has to be taken account when we think about the cognitive profile of aging in the world. I want to ask how all these cognitive changes are affected by AD. First, let me just start with a broad picture of cognitive aging. This is a cross-sectional characterization done by Denise Park and her colleagues (2002). What they did in this study was, in a very large sample in the United States, measure various aspects of cognition with multiple measures. They gave multiple tests of how fast people could process information; the capacity of participants’ short term memory and working memory (the distinction being that short term memory is conceived as memory for information where you just have to retain it with no interference and then repeat it back, whereas working memory requires you to hold onto information while you’re manipulating other information); multiple measures of long term memory; and then multiple measures of knowledge about facts about the world. In yesterday’s lecture, we encountered the same construct, but it was labeled crystallized ability. Crystallized ability is your knowledge for facts about the world and vocabulary. I’ll give you examples of some of the measures that they used. In particular, to look at long term memory, this is the way that we often look at long term memory in the laboratory. We give people a set of arbitrary visual patterns to memorize and then ask them to recognize whether a particular pattern is one of the ones that they studied. This is memory for information that is really divorced from semantic content where you have to hold onto an arbitrary visual pattern. Verbal measures that we use in the laboratory often include memory for lists of words, where again, the task is really bleached of much semantic content and certainly of event content. One of the free recall tasks that they used here was to study a list of 16 words and then to try and remember as many as possible. In the cued recall measure, they studied 16 cue-target pairs. For example, you might see the word “bread” paired with the word BASKET and then had to try to recall those targets in response to the cues. So you would be given “bread”, and you have to produce “BASKET” after the delay. If you compare across age groups, what you find is that there is a consistent decrease in performance as you look at older and older age groups. Now, I want

Event Cognition in Aging and Early Alzheimer ’ s Disease

143

to emphasize that this is a cross-sectional design, so it is not that we can say that a given individual is performing worse as they age. But when we compare the performance of twenty year olds to the performance of eighty year olds, there is a big decrement. If we measure how quickly someone can process information, we see a similar pattern. In tasks in which you have to simply decide whether two visual patterns or letters are identical or different, the measure there is how many you can get through in a fixed amount of time or in a task where you have to use a key to map from a digit to a symbol. And performance is considerably worse in older adults than in younger adults. In short term memory tasks—the first two are visual short-term memory tasks, and the second two are verbal short-term memory tasks—again, performance is worse in elders than in younger people. Finally, in working memory—these are tasks in which you have to remember the orientation of a line or a word or a digit while manipulating other information, solving algebra problems or comprehending sentences—on those kinds of tasks again, elders do worse than younger adults. And as someone who’s sitting about here (40s–50s), this is a discouraging picture. But, I want to tell you that there is more to the story. When you look at verbal knowledge, this is a very typical finding. This is a test in which people have to recognize words from their definitions, recognize synonyms, and recognize antonyms. And what you see is that as long as we remain healthy, these measures increase monotonically at least up through the 70s. Verbal knowledge is well preserved across the lifespan. There is a picture where many aspects of cognition show age-related differences where older adults do worse than younger adults. But there are important exceptions, and I am going to emphasize some other exceptions as we move on. I also need to emphasize that culture plays an important role in these effects that is not yet fully understood. I am especially cognizant of this point given that I’m here in Beijing, because some of the comparisons that uncovered the role of culture were comparisons between Chinese and American samples. Here is one study that was conducted in the early nineties to test a hypothesis about stereotype threat. Stereotype threat is a concept that was put forward by the social psychologist Claude Steele. What Steele noted is that being the subject of a negative stereotype can impose a cognitive burden on one. So if one feels that one’s group is less intelligent or less attractive or less successful in domains that are relevant to that stereotype, then this can weigh on your mind. That weighing on one’s mind sounds metaphorical and fuzzy, but they mean this in a very precise sense—that it can impose an attentional demand

144

Lecture 9

on you as you perform tasks relevant to the stereotype that impairs your performance, leading to a self-fulfilling prophecy. So the fact that cultural beliefs about aging and cognition vary from place to place means that the effects of this kind of stereotype vulnerability also might vary from place to place. What Levy and Langer (1994) did in this study was [[to]] test older and younger participants in the US and China. They measured memory using the kinds of tasks that I just showed you and also measured attitudes towards aging and cognition with a carefully normed and translated questionnaire. One of the important issues in these cross-cultural studies is making sure that the translations are apt. If I am using a questionnaire instrument and it’s translated, I need to make sure that I’m really giving effectively the same questionnaire to both groups. Questions of translation are really important here, especially given the different affordances of different languages. The questionnaire revealed that US participants had a less positive view of aging. In the younger adults, those in the US rated attitude toward aging lower than those in China. This view is held by older adults as well. So, this is a difference in attitude that is held by the people that are subject to those attitudes, not just by their younger colleagues. When they looked at performance on the memory tests, what they found is that in the United States there was a huge difference in performance between the younger and older adults, whereas in China there was a much smaller difference. The moral of the story: Grow old in China. Aging is associated with worse performance on memory tests and other cognitive measures. These effects are smaller in domains when people can form and use event models. And it’s moderated by cultural beliefs. This second point, actually, I really need to give you more data before you should buy. So that’s what I turn to next. I want to talk about how organizing information in event models can reduce age differences. This is a reminder of a paradigm that I introduced in the second lecture. I’ve expanded the description a little bit to try to make the design as clear as possible. This is a paradigm originally developed by John Anderson (1974) and called the “fan effect” paradigm. In a “fan effect” paradigm, what you’re doing is memorizing sentences that essentially associate an object with the location. You are reading lots of object location pairs, then afterwards you are given a memory test for those object location pairs. If you read something like the “information sign is in the park”, across this set of sentences, the information sign only appears once and the park appears once. That ought to be relatively easy to recognize later, if both of the items in the pair are unique. On the other hand, if you read “The welcome mat is in the laundromat; the pay phone is in the laundromat; the oak counter is in the laundromat”, now you have the laundromat associated with three different objects. What Anderson has shown is that

Event Cognition in Aging and Early Alzheimer ’ s Disease

145

associating a noun phrase with multiple other noun phrases increases the access time during retrieval. In his model, this happens because when I activate “welcome mat” and “laundromat” at retrieval time I also activate the pairing of “pay phone” and “laundromat”, “oak counter” and “laundromat”. Then I have to sort out that interference, so it takes me longer to access. Now the difference between this group [[the first three]] and this group [[the last three]] is that in this case [[the first three]] you’ve got three objects in one spatiotemporal framework [[in]] one setting and here [[the last three]] we’ve got one object in three settings. What Gabriel Radvansky and his colleagues (1996) proposed is that the fan effect ought to apply here [[the last three]]. But it could be that people would be able to use the common spatiotemporal framework to reduce interference in this case [[the 2nd to 4th sentence]]. If the representation that I construct as I study these items is an event model that represents the situation of being in a laundromat, and each time I get one of these new sentences I populate it with another object, then I shouldn’t see this interference—all have to do is access that one representation. Now I want to emphasize that I’ve sorted them here, so that the common objects and the common noun phrases go together. In an actual experiment, these are randomly presented, so you see these occurring in a string of pairs. In this kind of paradigm, the main independent variable is “fan”. That’s the number of nouns paired with a given noun phrase. In the multiple location condition, you would predict that there would be interference because each association ought to give rise to a different event model. Whereas in the single location condition, to the extent that people can construct a common event model with the objects in that one event model, then that should reduce the fan effect. What I showed last time was that people are slower to correctly recognize items that have been paired with multiple other items, if they are in multiple locations as indicated here, whereas there’s no effective fan if they’re in a common location. In the multiple location condition, there’s an increase in response time and in a single location condition not. This is true for older adults (in the dark lines) just as much as for the younger adults (in the light symbols). And a similar pattern shows for rejecting items that you didn’t see. We see a similar pattern in accuracy, and note that the older adults are quite likely to falsely recognize items that they never saw, if that object had appeared in multiple locations—much more so than younger adults. So there’s an agerelated memory problem here, but that age-related memory problem is pretty much completely ameliorated if the objects can be integrated into a common spatial location, into a common event model. Here’s another relevant finding, one that’s close to my heart. In this experiment, Arthur Shimamura and his colleagues (1995) tested community-dwelling

146

Lecture 9

younger adults and older adults, and both younger and older college professors. They gave them a set of laboratory memory tasks of the sort that I described before and a test of prose recall. The prose recall measure that they used is the logical memory subtest of the Wechsler Adult Intelligence Scale. They also gave them prose recall tests for a couple of technical passages, one of them having to do with the earth’s atmosphere and one having to do with anthropology. The older professors did less well on all of those semantically impoverished tests of memory that I described before, the ones where we have to memorize arbitrary visual patterns or arbitrary lists or pairs of words. But, if you look at the prose recall—remembering a short story—the college professors all do great. They do as well as the younger adults and better than the community-dwelling older adults. Old college professors do considerably better than other elders. If you look at the technical passages, they do strictly better than even the younger adults. We’re all headed into the right line of work. So, memory for arbitrary associations is affected by aging and other areas of cognition. Memory for coherent event models is much less affected, particularly when the person can take advantage of the event structure. Now I want to talk about memory in healthy aging and as compared to AD. The first thing to note is that memory complaints are a central feature of AD. They are usually the first symptom that brings an older adult into the clinic to get help. We used to think that memory disorders were simply just accelerated aging, that AD was just basically like getting old faster. But it now looks like that is too simple. The picture is more complicated. I want to describe a study that illustrates how it’s a little bit more complicated. This used the same logical memory passage that I described in the Shimamura (1995) study. This is the passage; this is one of the most widely used little bits of text in neuropsychology. The simple paragraph says Anna/ Thompson/ of South/ Boston/ employed/ as a cook/ in a school/ cafeteria/ reported/ at the police/ station/ that she had been held up/ on State Street /the night before/ and robbed of/ fifty-six dollars. / She had four/ small children/ the rent was due/ and they hadn’t eaten/ for two days. / The police/ touched by the woman’s story/ took up a collection/ for her. In the clinic or in the lab, this is read to the participant, and they’re told to memorize it as well as they can. The hash marks here indicate the units of scoring of this test. For each of these little bits, you can score whether it was recalled verbatim or the gist was recalled. You can count according to a stricter or more lenient scoring criterion how much of the passage was recalled.

Event Cognition in Aging and Early Alzheimer ’ s Disease

147

In this study, as I said, the story is read out to the people, they’re asked to repeat it back immediately and then again after doing a bunch of other tasks for about half an hour. These were scored for each of those units. They were given credit if they got the exact words right; they were given credit for getting the gist right, if they change the surface form but preserve the meaning; and they were given partial credit scored as a distortion if they had some of the predicate right but some of the arguments misremembered. We have four groups of participants: younger adults, healthy older adults, those with very mild AD and those with slightly advanced AD. What you see is that the healthy older adults remembering the gist, maybe they are slightly impaired on the delayed test, but they’re very comparable to the younger adults. Whereas the folks with quite early stage AD—these are people who you would not immediately recognize as being cognitively impaired if you met them—they are considerably impaired, even at the earliest delays. Whereas healthy aging preserves memory for a narrative prose for a sequence of events pretty well, even very mild AD leads to an easily detectable impairment in that kind of memory. So this suggests that there’s something interesting to look at in the event cognition of people of AD, as compared to the event cognition of healthy older adults. To address this, I want to return to the paradigm that I discussed in yesterday morning’s lecture, in which we used very short-term memory for information in a prose story to look at two kinds of memory updating. To remind you, I’ll read this passage again. This is the passage in which what we’re going to do is [[to]] change the spatial location of the action or introduce a new character and probe memory from time to time. The story goes Jim picked up his keys from the basket by the front door and paused. The basket was supposed to be a place just for keys, but his were always buried under everything else in there. Jim hated how it became a place to keep junk. From now on he would keep it clean, he vowed. He found his keys and walked into the garage. So, with “walked into the garage”, we change the spatial location and then we probe “by the front door”, so we’re probing spatial information after a change in spatial location. The story continues on: On the top shelf in the corner, Jim saw the box that his wife had conveniently labeled “Camping Gear”. As he pulled it down, the sleeping bags that had been piled on top fell down around him. “At least I won’t forget

148

Lecture 9

those,” he muttered as the last one bounced off his shoulder. Opening the tote, he found matches, fire starter, flashlights, camping dishes, and some random pieces of rope. Walking into the garage, Kathy laughed at the pile of stuff surrounding her husband. Here, we’re introducing a new character and keeping the spatial location the same. But then we’re probing spatial location again. So we can probe information about spatial location after a spatial shift or after a character shift. Conversely, we could probe information about characters after a spatial shift or a character shift. We want to distinguish two kinds of updating. One kind of updating involves taking your current event representation, blowing it away, and forming a new one. We call that global updating. The other kind of updating involves continuously incorporating new information that is introduced in the text into your representation. We call that incremental updating. And if you’re updating globally or incrementally, then in this case (“He found his keys and walked into the garage”) you ought to update spatial information when you change the spatial location. If you map the new spatial information into your event model, just that piece of information, then that might interfere with retrieving this piece of spatial information. If what you’re doing is creating a whole new event model, then it’s less likely that information would still be in your new event model. Either way, we would predict that you would be slower when you are being probed with “by the front door”. Contrast that with this case where we have switched characters but we’re going to probe spatial information. Here, if you are incrementally updating, if you switch the characters the spatial information ought to still be preserved, because it hasn’t changed. On the other hand, if you’re updating globally, if what you’re doing is rebuilding your event model at those points, then with some probability you may have erased that information about the spatial location when you updated. Conceptually, this predicts that if one is doing global updating, if we test memory for the dimension that didn’t change, it should be slow as memory for the information that did change. Whereas if you’re updating purely incrementally, then information for the unchanged dimension ought to be as fast as if there had been no shift at all, and only the information that changed ought to be slowed. We use regression models to estimate the effects of shifts on the unchanged dimension and the change dimension controlling for extraneous variables such as sentence length. In yesterday’s lecture, I showed data from younger adults. Here I’m interested in the comparison between younger adults and healthy older adults.

Event Cognition in Aging and Early Alzheimer ’ s Disease

149

We have seen that healthy older adults have excellent comprehension and long-term memory for narrative prose. One possibility, then, is that the older adults will update similarly to the younger adults. Another possibility is that older adults might be more global or more incremental in their updating. One reason you might think that older adults are more global is that data from a number of experiments suggest that older adults are more tuned to the situation model or event model level of the discourse than to the surface structure. And that it might be more facilitatory to update a whole event model than to focus on individual components. However, there are data suggesting that, if anything, older adults might be more incremental in these situations, because older adults show more incremental updating in response to syntactic shifts in narrative. They show more slowing down at the micro components of the syntactic structure. And there is work from Campbell and Hasher (2010) suggesting that older adults have difficulty inhibiting the incorporation of new ongoing material into their running working representation even when they have been asked to avoid doing that. It is equally reasonable that older adults might be more global or more incremental. What we saw yesterday is that younger adults show evidence for both global and incremental updating. This is a different dataset than the dataset yesterday. But what we see is that for both probes of character information and spatial information, people are a little bit slower when they are probed on the dimension that did not change, which suggests that they’re doing some global updating. But they’re further slowed when they are probed on the dimension that changed, which suggests that, above and beyond doing global updating, they are updating selectively the dimension that changed. To our surprise, in this dataset, the older adults looked as if they were doing completely global updating. So as you can see in both cases, information on the unchanged dimension is slowed at least as much as information on the dimension that changed. This suggests that the algorithm that’s being used to update the event model during prose comprehension by this sample of older adults is to update the event model as a whole and not to map new information into the event model as it’s going on. The next topic I’d like to address is the relationship between the segmentation of activity into ongoing events and long term memory. We’re about halfway through, so this is a good place to take a short break. What I’d like to do now is to talk about the relationship between how people segment ongoing experiences into meaningful events and how that relates to memory encoding in healthy aging and AD. For those who have been with us the whole week, this will be a little bit of a review, but I just want to briefly remind all of us of the measures that we typically use in my laboratory to study

150

Lecture 9

event segmentation. In these studies, what we’re going to do is [[to]] show people a movie of an everyday activity and ask them to push a button to mark when one meaningful unit of activity ends and another begins. There are two measures of how they are segmenting that activity that I am going to be interested in here. There are lots of things you can look at in these kinds of data, but these two prove to be interesting with respect to aging and AD. The first is a measure that we call segmentation agreement. In general, if you ask a group of observers to segment an everyday activity, you find strong intersubjective agreement about where the boundaries are. We can quantify that by binning the data into some standard interval—we usually use one second bins. For each individual, we can record whether they segmented during that one second bin. Then we can add up all of those time series from all the individuals to get a norm for the group. So then if we want to know how well does a given individual agree with the rest of the group, we can correlate their segmentation with that of the group. Sometimes it’s helpful to rescale the data. The correlations depend on how many event boundaries a given individual identifies, and that can be a source of extraneous variance. Sometimes we will rescale the data so that the worst correlation that you could get given the number of event boundaries they identified is rescaled to zero and the best correlation you could get is to rescaled to one. That is the measure we’ve been talking about quite recently. This other measure we haven’t talked about for a few lectures, so it is probably especially important to remind us of it. Segmentation also tends to be hierarchically organized such that people spontaneously group smaller units of activity into larger super structures. We can quantify that by comparing the locations of the boundaries when someone is asked to segment a movie twice, once to identify fine-grained units, which we usually describe as the smallest units that are natural and meaningful to them, and once to identify coarsegrain units. If someone is segmenting an activity at a coarse grain and they identify, say, a boundary here and here, and then they segment it later at a fine grain and identify all of these boundaries, if when they were doing this finegrained segmentation, they were spontaneously grouping the fine-grain units into larger units, then a subset of the fine-grained units ought to line up with the coarse-grained units. So we can measure the extent to which they line up. The way that we measure the extent to which they line up is: For each of these coarse-grained event boundaries, we find the closest fine-grained boundary when we measure its distance, and then we can estimate how close on average the closest finding would be to the coarse-grained units if the relationship between them was random. The extent to which these fine-grained boundaries

Event Cognition in Aging and Early Alzheimer ’ s Disease

151

are closer than would be expected if it’s random is going to be a measure of hierarchical organization. So, we have two measures of the perception of events structure: segmentation agreement and hierarchical alignment, and we have multiple measures of event memory. Just a reminder that one measure that we’re using a lot here is event recall, where we just show people the movie after some delay and ask them to tell us what happened. If somebody saw a movie of a person making breakfast, they might then say “A young lady came into the kitchen, washed her hands at the sink and dried them off, and went and got a skillet—I guess, er, it looked like a skillet to me”…. This is an example of a good recall protocol. In the last study I presented, yesterday afternoon, we simply counted the number of words that people gave us in all of these studies. Here, we are going to be hand-scoring the recall protocols to count exactly how many fine-level actions the person correctly recalled. We can test their recognition memory for the visual information in the stimulus by showing them a picture that actually appeared in the movie that they saw paired with a picture taken from a similar other movie and asking them to choose the correct picture. In one study conducted by Christopher Kurby (2011), when he was a postdoctoral fellow in the lab, participants watched five movies of everyday activities—doing laundry, setting up a tent, planting some plants, washing a car, and building a model out of a construction toy—and they segmented those to identify a coarse and fine event boundary. Then they recalled the movies and then they perform the recognition test. What you find in two experiments is that younger adults agree with each other better than older adults agree with each other. And I want to note that it is not the case that older adults are systematically picking out a different set of points in time to identify. If you compare a given younger adult to a younger adult norm or to an older adult norm, or you compare an older adult to an older adult norm or a younger adult norm, you still get the same result. What is going on here is both groups are basically identifying the same boundaries, but the older adults are just doing so less consistently than the younger adults. The older adults also have reliably less hierarchically organized segmentation. In both experiments, the degree to which the event boundaries are lining up between coarse and fine is higher in the younger adults than in the older adults. Yesterday, we saw that in a number of relatively large samples there was a strong relationship between how well one segments an activity and how much one remembers later.

152

Lecture 9

What we find is, replicating the data that I showed you before, in these older adult samples, there is a strong relationship between segmentation agreement and memory. So in the older adults, memory performance is lower overall, segmentation agreement is lower overall, but within that group there is a strong relationship between segmentation agreement and memory. There is also a strong relationship between hierarchical alignment and memory. This is a new result that I haven’t showed you before, but we see this quite typically. In older adults, like in younger adults, the quality of segmentation is related to the quality of subsequent memory. How well one encodes a series of events during viewing predicts how much one remembers later. This suggests an avenue for intervention—that maybe one way that we can address age-related memory complaints is by trying to move people up these curves by intervening on segmentation. One thing that we’ve been very interested in looking at in the difference between healthy aging and AD is the neural correlates of these behavioral effects on segmentation and memory. One of the areas of the brain that is most early affected in AD is the medial temporal lobes. This is the hippocampus and surrounding structures including the entorhinal cortex and the parahippocampal cortex. In collaboration with our Alzheimer’s Disease Research Center (ADRC), we have been able to look at the structural integrity of these structures in relation to these memory phenomena. In the study conducted by Heather Bailey (2013), older adults aged 63–90 were recruited from our AD Research Center. These participants are really truly heroes. They come in once a year and complete an exhausting clinical exam and cognitive testing; they undergo functional and structural MRI scans longitudinally and repeatedly. Many of them also have longitudinal lumbar punctures to measure the presence of amyloid in their cerebrospinal fluid, which is associated with AD neuropathology. These people do a tremendous amount for us. In this study, a group of them came into the lab to do further cognitive testing. These people were recruited to have a diagnosis that they were healthy—they did not have AD, or they had very mild AD, or slightly more advanced AD. In addition to the data that we had from the ADRC, in the lab we had them perform the event segmentation and memory tasks. We used the cognitive testing data from the ADRC and—critically for the data I’m going to show you—the neuroimaging from the ADRC. We had measures of the volumes of the hippocampus, parahippocampal gyrus, and entorhinal cortex of all of these people from the ADRC. This is just illustrating these structures. Those of you who have been here long have seen pictures of the hippocampus before. You may recall that this structure was selectively activated when people had to reach back across an

Event Cognition in Aging and Early Alzheimer ’ s Disease

153

event boundary and grab something from their episodic memory. The entorhinal cortex and impaired parahippocampal cortex are immediately adjacent to the hippocampus in the medial temporal lobe. Behaviorally, what we find is that, as in the previous data sets, there is a strong relationship between segmentation performance and event memory. Here we are using a composite of the recall and recognition tests. What we see is that there’s a strong correlation between how well one segments the activity that is measured by segmentation agreement and how much one remembers later. Now this is grouping together all three of the groups of older adults. But if we break them out, what we see, first of all, as you’d expect, there is a progression such that the healthy older adults are segmenting better and remembering more than those with very mild dementia who are in turn segmenting and remembering better than those with more advanced dementia. So these people are having quite poor memory performance. If you look at their recall protocols, they are watching a five-minute movie and only giving a couple of sentences if they are down here. But even in this most demented group, there is a significant relationship between segmentation and memory. Folks who are segmenting well are not remembering that much worse than the mean. So above and beyond a clinical diagnosis of AD, how well you segment an activity predicts how much you are going to remember. The main question we wanted to ask about the structural neuroanatomy was: Does atrophy in these regions, as measured by the volume that we assayed in the MRI account for this correlation between segmentation agreement and memory? I have depicted the results here in a Venn diagram where, if we look at the correlation between event memory and segmentation agreement, that’s accounting for about 14% of the variance after controlling for other things and about half of that variance is accounted for by the volume of the medial temporal lobe. I want to emphasize that this is in a cross-sectional study and we didn’t have longitudinal MRI measurements of very many of the people. Our power is limited by the fact that some people are just born with bigger medial temporal lobes than others. Even in the face of that individual variability, you can see this strong relationship where people who have less volume in their medial temporal lobes segment less well and they remember less well. And those folks also are the ones who wind up with an AD diagnosis. Almost everything that I’ve said about relations between event perception and other aspects of event cognition so far has been about memory or language. But way back at the very beginning of these lectures, I made this claim that the representations that we use to understand events in perception and

154

Lecture 9

reading are the same event representations that we use to guide our actions. And in this study, we wanted to make an initial assay of this in the context of AD. To do this, we collaborated with Tania Giovannetti (Bailey et al., 2013) at Temple [[University]], using a test that gives a fine-grained measure of people’s ability to act in everyday events. So Giovannetti, Myrna Schwartz and their colleagues have developed a test they called the Naturalistic Action Test, in which people perform everyday activities that have been very tightly instrumented and normed. We can characterize in a lot of detail the number of steps that are accurately performed and the kinds of errors that people make. In this version of the test, what people have to do is to pack up a child’s backpack and lunch. They are given a whole bunch of props that are needed to do that, including a sausage that we have to keep in the refrigerator to bring out just for these studies, and bread and mustard and juice and a bunch of objects that were not relevant for performing the task. We can ask to what extent are people grabbing objects that are not appropriate? Then, we can ask how many of the steps does the person perform, what kinds of errors do they make, do they leave out step, and do they perform steps incorrectly? I am going to show you a composite measure of how well people performed overall. On this zero-to-six scale where six is better, what you see is that there is a strong relationship between how well people segmented completely different set of activities and how well they can perform this everyday task. So that suggests that the individual differences in the ability to construct these event representations in perception are related to their action performance. And just like the relationship between segmentation agreement and memory, this relationship can hold even after you control for dementia status. So in the healthier and older adults, they’re performing better and segmenting better; the very early stage AD participants are performing a little less well and segmenting a little less well. Even in this older group, you can see that a few people, the people who are zero, literally just look at the objects, maybe try a couple of things and then give up. These people have reached the stage at which they are not able to live independently because they can’t do simple things like make lunch. But some of them who have been given that more advanced diagnosis are able to do the tasks, and those are the people who are able to segment the other activities. So, the ability to segment activity appropriately is associated with the ability to perform a different everyday activity above and beyond participants’ dementia status.

Event Cognition in Aging and Early Alzheimer ’ s Disease

155

I want to digress for just a moment to say something to those who might be interested in clinical assessment of cognitive performance. One of the things that we were concerned about when we got into this line of research is that the task of watching a movie and then just telling me what happened in that movie feels very loosely structured and very subject to idiosyncrasies of what the participant happens to be interested in and what their background knowledge happens to be. We were concerned about whether simply recalling a movie or recognizing pictures from it would be a sensitive measure of episodic memory, compared to the well-characterized and well-normed tests that we often used in the laboratory. On the other hand, if these kinds of memory measures are sensitive and reliable they can be really valuable in the clinic, because our memory-impaired participants really hate taking the laboratory memory tests. You bring someone in, who first came into the clinic because they have a memory complaint, and you say “I’m going to show you lists of words or pictures and ask you to memorize them.” We were talking about stereotype threat before. That’s an uncomfortable situation to be in, in which you are stressing me on the very cognitive function that I’m anxious about and where I know I don’t perform well. Whereas if we bring people and ask them to watch a movie and then tell us what happened, it’s a perfectly pleasant task. Our participants leave the lab happy. So, we are interested in how effective these measures are for diagnosis and differentiating groups compared to the laboratory measures. In this study, everybody was assessed with a standard neuropsychological battery. They take these tests at the clinic as part of participating in the ADRC. These are three standard clinical measures: the selective reminding test is a test in which people learn categorized lists of words and then have to first try to freely recall them and then recall them in response to category cues; the verbal paired associates test is a test where people learn pairs of words and they are given the first member of the pair and they have to recall the second; the logical memory test you’ve encountered—this is the most user-friendly of the three. Then we looked at their clinical dementia status and also at their genotype for the Apolipoprotein E (APOE) gene. This is a gene that codes for a protein that is involved in the metabolism of amyloid, and amyloid has been known for some time to be involved in AD. People who have the e4 allele of this gene have an increased risk for AD, and in sensitive memory tests, you can see differences even in people without a diagnosis of AD between those who have one or two copies of this allele and those who don’t. So we wanted to know which of these memory tests, could register these differences in their clinical dementia status and in their genotype.

156

Lecture 9

The advantages of the neuropsychological measures are that they are well characterized and they are well controlled. But on the other hand, these event memory measures have some advantages themselves. They are easy on the patients. And, you know, we think if people are engaged and interested and the materials are rich, there is a potential to drive more variance in the system. Here is what you find: On the laboratory episodic memory test, as you would expect, the healthy older adults do best; those with more advanced dementia do worse. There is a small but significant difference which is most evident in the intermediate dementia group between those with the e4 allele and those without the e4 allele; the people without e4 allele are doing better. There are not equal numbers of e4 positive and negative people across these groups, because those with the e4 allele are more likely to have AD. But within each group, there are some people who have it and some people who don’t. If we look at the event memory measures, we see just about the same picture. And in fact, the effect size is just about the same. So [[the everyday event memory]] measures take about the same amount of time, maybe a little less time to give than [[the laboratory measures]]. They are more user-friendly, they don’t upset our participants, and they seem to have the same kind of sensitivity. So, the everyday event measures or event memory measures show very similar effects of aging dementia and they are a lot less aversive. The last topic I want to turn to today is the neural correlates of segmentation and memory in healthy aging. We haven’t yet looked at this in AD. In a recent study, Christopher Kurby and I (in press) tested a relatively large group (for an MRI study) of older adults and a comparison sample of younger adults. What we were interested in was how individual differences within the older adults in their behavior would relate to individual differences in the evoked responses in their brains while comprehending events. Everybody viewed movies of everyday activities in the scanner. They didn’t perform any task, they had never been informed of anything about event segmentation— they just watched these movies for comprehension. They segmented them into fine- and coarse-grained events outside the scanner. So after we record their brain activity, we introduce them to this task in which they segment the activity to mark boundaries between natural and meaningful event units. They perform those tasks on the movies, and then we test their recognition memory and their recall memory. For those who were here at the fifth lecture, you recall that there is a large set of brain areas that show phasic increases in activity around the time of event boundaries. The distribution across the brain of those responses in this sample is very much the same as what I showed you in the fifth lecture. You see large changes

Event Cognition in Aging and Early Alzheimer ’ s Disease

157

in the back of the brain. This is viewing the left hemisphere from the side laterally and then medially and then the right hemisphere laterally and medially. In the back of the brain at the juncture of the parietal, temporal, and occipital lobes, laterally and medially, you see a strong evoked response. And then you also see responses in lateral frontal cortex which are generally stronger on the right hemisphere. So, very much like what we saw before. If you look at the timecourse of response, what you see again is also quite similar to what we saw before. So recall that the fMRI response is time-lagged relative to the neural activity by five to ten seconds because it’s a blood response. What we see here is an activity that rises and peaks in about five or six seconds after the point at which they had identified an event boundary. So again, I want to remind you that this is what was going on in their brain while they were initially viewing the activity, time locked to the event boundaries that they gave us afterwards. What we see is, across these areas, a response that rises and peaks about five to ten seconds later and is larger for coarse-grained event boundaries than for fine-grained event boundaries. And the response is quite similar between older and younger adults. And in fact, we found no place in the brain in which there were significant differences in the response of the younger and older adults. Now, the primary objective of this study was not to detect a group difference between the younger and older adults, and so it’s underpowered for that, so take that caveat. But this is actually a little bit surprising and I want to return to that point a little bit. The main question we wanted to ask in this study is: Are there individual differences within these older adults in their brain response that are related to their cognitive performance? The measure that turned out to be of most interest is a measure that we adapted from the work of Uri Hasson’s lab (2004). I mentioned a study by Zadbood and his colleagues (2017) in the fifth lecture. The measure that we used here is very similar to that measure. What we’re interested in is whether, as you’re watching a movie over time, a given part of your brain is rising and falling in synchrony with the rising and falling of the other people who watched that movie. So it’s kind of related to the behavioral event segmentation measure, where we are interested in whether people segment the activity of the same places. Here we are interested in whether the brain activity is going up and down at the same places across individuals. Here is where we leveraged having a separate younger adult group. We wanted to ask whether as an older adult having brain activity synchronized to the brain activity of the younger adults is associated with better cognitive performance. So you take one individual’s time-course of brain activity for a given region of the brain. And then you take the same region of the brain, pull timecourses

158

Lecture 9

out of all of the younger adults, and average those together and then compute the correlation between these. And then we do this for every voxel in the brain to create a correlation map. Across the brain, how strong is that person’s brain activity correlated with the younger adults? That’s going to be our measure of younger adult-like brain activity. And then we can ask, Where in the brain is younger adult-like brain activity associated with better or worse cognitive performance? What you find is, replicating the work out of Hasson’s lab (2004), there is a big swath of activity in the posterior parts of the brain that is most strongly synchronous across observers. We found three results where normative temporal dynamics—having rising and falling in your brain that’s more like younger adults—is associated with better cognitive performance. People with higher segmentation agreement had more synchrony with the younger adults in the right posterior superior temporal sulcus. This is striking to me, because this is the area of the brain that is most activated at event boundaries. We also found that people with better hierarchical alignment had more correlated activity in the left dorsolateral prefrontal cortex. These dorsolateral prefrontal regions are associated with top-down control and with working memory maintenance. And these areas, the more anterior areas, are associated with longer-duration event representations. And then we found that in this area in the insular cortex, just tucked in behind Broca’s area in the frontal lobes, that people who have longer event boundaries, as longer events, have less synchrony. In other words, people who had higher synchrony were segmenting into finer-grained events. I don’t have a great interpretation of that result. Overall, older adults had brain responses to event boundaries that were very similar to those of younger adults. This is somewhat surprising. We were actually quite surprised that we did not see group differences between the younger and older adults. However, I want to note two caveats. The first is, of course, that this is really powered for the individual differences not for the group differences. But the other is—and this really hit home by the time we finished this study—not everyone is eligible to be a participant in an MRI study. If you have metal implants anywhere in your body, those are dangerous, so we can’t put you in the scanner. If you have vision that’s not correctable with lenses, then you can’t see our stimuli in the scanner. If you have a history of stroke or neurological disorder that compromises our ability to interpret the brain data, we usually exclude those people. There is a long list of exclusions. The older adults that are eligible to participate are a super healthy group of older adults. That may minimize our sensitivity to detect age differences. But the thing that

Event Cognition in Aging and Early Alzheimer ’ s Disease

159

I think is quite exciting is that within the older group, better neural synchrony with younger adults were associated with better behavioral segmentation in that area that is strongly associated with segmentation. Before I leave I just want to return to one of the results from yesterday to make a point about interventions to improve memory. It’s one thing to be able to characterize the relationship between event memory and other kinds of memory in healthy AD. The news has a few bright spots, but it’s mostly a little bit concerning. The obvious million-dollar question is: What can we do about it? One thing that we have been focused on in the lab is cognitive interventions that can scaffold people’s event comprehension. Yesterday I described this paradigm in which people watch movies of everyday activities that we’ve instrumented by pausing at event boundaries, pointing out salient objects and ringing a bell. And what I showed yesterday is that when we compared viewing of those movies to viewing of uninstrumented movies, these movies are better remembered. And I just want to remind you that the data in this experiment came from both younger and older adults. In this study, we coded memory for information at the boundaries, at midpoints, and other information. And for a third of the participants, we cued the boundaries; for a third of the participants, we cued the midpoints—which we thought ought to be harmful—and [[for]] the remaining third we just presented the original unedited movies. What we found is that compared to the edited movies, cuing the event boundaries boosted memory for those boundaries. And this was just as powerful in the older adults as in the younger adults. Just to remind you, to our surprise, cuing the event middles didn’t seem to mess people up. So this suggests one strategy: You could imagine training a computer model to segment my favorite TV shows for me and then to break them up into appropriate event units as I’m watching TV. It’s a little bit funny but it might work. This is the kind of stuff that really impacts people’s quality of life. People say, you know, the TV cuts to a commercial and when it comes back, I’m lost and I don’t remember what was going on. And if we can address those kinds of things, that can really improve people’s quality of life. To sum up, older adults perform less well than younger adults on a wide range of laboratory measures of cognition. I want to emphasize that, you know, that we saw in that Levy and Langer’s (1994) study that this difference was smaller in some cultures than in others, particular smaller here in China than in North America. So part of this may have to do with cultural expectations about aging and memory. That said, age differences in tasks for older adults can use event knowledge and event models are often smaller or absent, which

160

Lecture 9

is encouraging. Measures of event segmentation predict elders’ memory and ability to perform everyday activities. And individual differences in the neural correlates of event comprehension predict behavioral segmentation. We’re really excited that it is looking like interventions that are effective in younger adults are also effective in older adults. I’ll leave off there and thank you.

Lecture 10

Event Representations from Cinema and Narrative Fiction Thank you so much for that kind introduction. I’m honored to be here at Peking University for this last lecture, and I’m very grateful for those who are joining us and for those who have stuck with us for these nine previous lectures. It’s amazing to think that we are now coming to the end. Over the course of these lectures, I have been developing an account of how people perceive, comprehend, conceive, remember, and plan for events. In this last lecture, what I’d like it to do is [[to]] apply these concepts to our comprehension of media—in particular, film and narrative text. Most of what I am going to say today ought to stand on its own; if you haven’t been to the previous lectures, I think it should still make sense. I will try to back up and provide some background as needed, but hopefully most of it will stand on its own. For those of you who have been here for the previous lectures, I encourage you to try to relate the more theoretical concepts to these applications as we go forward. I want to start by noting something about our habits as humans consuming media. The first thing to note is we watch a lot of movies. In 2009, the last year I was able to get data for, 1.5 billion movie tickets were sold in the United States and Canada, and by that point streaming had already taken off, so that had already eaten into some of the ticket sales. It’s not just humans. You can put your cat or your dog in front of the movie screen and they’ll watch happily. And it’s not just movies. Similarly, reading is one of these ubiquitous pastimes; people read for pleasure every chance they get in every environment. This raises a natural question: What evolutionary adaptations support our ability to understand film or to read? But of course, as soon as I ask the question, the obvious answer is “none”. These functions are too recent in our evolutionary history to have been shaped by biological evolution. Books are All original audio-recordings and other supplementary material, such as any hand-outs and powerpoint presentations for the lecture series, have been made available online and are referenced via unique DOI numbers on the website www.figshare.com. They may be accessed via this QR code and the following dynamic link: https://doi.org/10.6084/m9.figshare.8982488.

© Jeffrey M. Zacks, Reproduced with kind permission from the author by koninklijke brill nv, leiden, 2020 | doi:10.1163/9789004395169_011

162

Lecture 10

of course the older of the two, but written language is a relatively recent cultural product. Spoken language is an adaptation. Every healthy human will acquire spoken language, but reading takes special instruction and years of practice. Similarly, event comprehension is the ability to understand the auditory and visual events of our everyday environment. That’s an adaptation. That’s something we share with all humans and with our nonhuman cousins. But film comprehension has only been around for about 120 years, so our ability to comprehend film couldn’t possibly have been shaped substantially by biological evolution. So, how is it that we can make sense of these cultural products that are so complex and so particular in their structures? What I want to argue is that the particular structures that we see in narrative text and in film are by no means accidental. They are shaped by cultural transmission, but they are rooted in our biological adaptations. The basic idea is that animals evolved an event comprehension system, the system that we’ve described over the previous nine lectures, a long time ago. We have common ancestors going back and spanning the mammalian taxa, and probably other species that share a lot of the event comprehension system with us. And as I argued in the third and sixth lectures, we did this because it’s really, really valuable, in evolutionary terms, to be able to anticipate the near future consequences of our own actions and the development of the environment around us. So, animals evolved the ability to represent events and used that information to guide their predictions about what’s happening around them a long time ago. Some species, particularly in the primate lineage, extended the time-scale of these predictions a little bit farther to enable planning for the more distant future—offline planning of the sort that you need if you’re going to coordinate with your conspecifics on things like hunting and agriculture. According to this story, language evolved because it facilitated coordination amongst our conspecifics as a social species. If we’re going to engage in larger projects that require a group of people coordinating together, we’ve got to have a means of coordination and that’s the pressure that shapes language, and in particular, one great way to coordinate my activities with your activities is for us to have a communications medium that allows us to bring our event models into register with each other. That is what narrative language does and that is the same kind of structure that narrative film picks up on. One implication of this view is that our brains ought not care too much about whether the contents of an event model within them come from our real experience or from having heard about it or read about it or seen it in a movie. In other words, if there’s a herd of prey animals in the next canyon over, what I’m going to do with that information doesn’t really depend on whether

Event Representations from Cinema and Narrative Fiction

163

I saw them with my own eyes or my cousin told me about them. This suggests that our brains might not be optimized for tracking the source of our event representations in long-term memory, and consistent with that it turns out that we actually have really poor memory as a species for the source of event information in our memories. You can see this vividly in studies of misinformation, where two sources give you conflicting information and you have to sort out which source provided which piece of information. In the domain of film, Andrew Butler and his colleagues (2009) did a really neat study of this. They showed people movies that depicted historical events. This is a still picture from a film called The Last Samurai, which starred Tom Cruise. In this movie, Cruise plays a disaffected veteran of the wars between the United States government and Native Americans who moves to Japan to advise the Japanese emperor on putting down a samurai resistance. When they show a movie like this, they start off with “Based on a true story”. But in this case, the movie was pretty loosely based on the true story. The actual advisers who went to advise the emperor were French. And the whole back story on which this movie is based about him relating his experiences in the American west to his experiences with the samurai is just made up from whole cloth. What Butler and his colleagues (2009) did is to have participants read a set of historically accurate essays about events such as the samurai rebellion. For some of those essays they also saw a Hollywood film excerpt that included distortions. They saw the film either before or after the essay. In some cases, the essay was accompanied by a warning. They gave two kinds of warnings. The general warning just said, “Look, filmmakers sometimes take liberties with the facts and the accurate facts are in the essay that we’re giving you. If there’s a conflict between the essay and film, go with the essay”. In other cases, they got highly specific warnings about what was going to be distorted. In the Tom Cruise case, they might get a warning that said, “The advisers in this film are going to be of the wrong nationality. So, watch out for that”. Then, they took a memory test in which they were instructed to recall only facts that they learned in the essay. Here’s what they found. If people were not warned, they were quite likely to falsely endorse statements that reflected what was presented in the movie, compared to if they didn’t see the movie clip. They had a pretty high rate of false memory, attributing information that came from the movie to the essay. In other words, they were poor in their ability to discriminate whether the information came from the text or from the less reliable source. It didn’t matter whether they got the movie first or second, either way was bad. If they were given a general warning, it had no effect. Just telling people to be careful had

164

Lecture 10

no effect on their ability to discriminate stuff from the lousy source from stuff from the good source. If they were told exactly what missing information was going to be there, they were able to track that. There is some hope. So, there were high rates of intruding the misinformation from the movies, order didn’t matter, and if a warning is going to be effective, it has to be specific. People are disposed to build event models, and perhaps a common mechanism supports their ability to build event models from real life and from media such that they’re not able to tell what information fed into the event model construction in a particular instance. The next question I want to take up is: What are the constraints of our biological adaptation for constructing event representations that shape how media is going to be constructed? What I want to argue is that we have a set of perceptual and conceptual mechanisms that are adapted for real-world event comprehension, which basically determine how things like movies and narrative texts have got to be built if they’re going to work well. One way to think about this is in terms of the bandwidth of our sensory input. Consider this scene from the classic film Raiders of The Lost Ark. There’s a lot going on. You’ve got objects moving. You’ve got terrifying chases. You’ve got all kinds of sound and motion. This is an older film that was made on analog film; the resolution in terms of the visuals and the audio now if anything even higher now. But we are going to leave our protagonist hanging in the air there and just consider the fate of one picture element, one pixel in that image and think about how much information there is in a complex action movie like Indiana Jones and the Temple of Doom. To represent one pixel in that movie with adequate color fidelity requires about 24 bits—24 on-off switches of information. At the normal kind of video standard that Raiders would have been digitized at, there are 3.2 million bits in a single frame. (Recently that resolution is increased by a factor of about ten.) This film was shot at 24 frames per second. It is 7000 seconds long, so that’s 1.25 trillion bits of information. If each one of those bits was a piece of popcorn, they would wrap around the earth three times. And if anything, the bandwidth of our natural experience is greater than that in the movie: We have a wider field of view, we have haptic sensation, we have emotional reactions. So, the bandwidth in unmediated events is if anything higher than the bandwidth in mediated events. How can our perceptual and cognitive systems cope with that kind of challenge? The psychologist Hugo Münsterberg (1916), writing about mediated events in the early days of film wrote this:

Event Representations from Cinema and Narrative Fiction

165

the chaos of [the world around us] is organized into a real cosmos of experience by our selection of that which is significant and of consequence. This is true for life and stage [[and, he went on to say, film]] alike. So, what is the selection that operates? How is it that our perceptual systems transmute the vast flow of sensory input into a manageable set of coherent event models and coherent representations of events that relate to each other in a meaningful fashion? I can illustrate some of the strategies the system takes with this short film. If you happen to have seen this before, please raise your hand while we’re watching. This film was made by Transport of London to encourage people to watch out for cyclists—which seems like a live issue in Beijing! I don’t know that this has anything to do with looking out for cyclists, but I think it tells us a lot about the event comprehension system. Why is it that humans are so insensitive to violations [[changes in objects across cuts]] such as we saw in this movie? In the film trade, they call these continuity errors. If you go on to movie fan sites, you can find tallies of continuity errors; I think they found about 400 in the film Titanic. That’s pretty typical. Rabid fans can find these things if they look for them, but normal viewers basically miss them all. I think we can describe this in terms of three basic mechanisms. One is filtering. A lot of information just never gets encoded in the first place. We have to be attending to information in order for it to be encoded in a way that’s durable. You may not have noticed that there is a potted palm in the scene at all. You may have never attended to it, in which case if they change the potted palm, then that is not going to be noticeable to you. I think the mechanisms of event segmentation that we have been focused on throughout the week are utterly critical here. After each person gives their alibi, that is what filmmakers call a beat. It’s the end of an element in the story. This film notice had no edits. It was all done with camera motion, so even without editing, the content of the events may lead you to update your working model of the current situation after each of those alibis is given, rendering previous information less accessible and determining the chunks for subsequent retrieval. And a third mechanism that I think is really important is abstraction. Our event models do not represent all of the perceptual and sensory information at the highest level of detail. If so, they would dramatically exceed our capacity to maintain a representation online through active maintenance. Instead, they function by activating links in part to information in our long term semantic memory. If you encoded the roses on the table not at the subordinate level

166

Lecture 10

of roses but at the basic level of flowers, then if it changes from one flower to another you are not going to notice. These three mechanisms of filtering, chunking or segmentation and abstraction are critical to our comprehension of events and they allow media to work the way they do. So, media can guide these naturally-existing mechanisms of schematization to produce particular effects that are desired by the author. I just want to illustrate that schematization is such a ubiquitous component of writing and filmmaking that we don’t even really notice that it’s there. If I show you a passage like this and ask you if it is highly schematized, my first reaction would be to say, “No, this is a very naturalistic prose”. This is from a recent biography by Ron Chernow of Ulysses S. Grant, which I can highly recommend. It has this passage: Grant asked Lincoln to borrow his large bay horse Cincinnati, while he rode the small black pony Jeff Davis. I’ll just note that this is funny for two reasons. One is this is in the middle of the Civil War and the pony is named after the confederate president. And the other thing that’s funny is that Lincoln was a very tall man, and there are cartoons of him sitting on horses with his feet dangling on the ground. So it was very considerate of Grant to offer Lincoln the taller horse. Still attired in black, Lincoln wore his trademark headgear, a tall silk hat, which was promptly knocked off by a tree branch as they galloped along. The weather was so dry the president was coated with dust when they reached the Union line. There is nothing particularly unusual about that prose. If I was told that is particularly schematic prose, my first reaction would be, “No, that’s just normal narrative”. Compare that with the following passage, which the performance artist Kenneth Goldsmith made. Goldsmith decided that what he was going to do was wake up one morning and write down every movement his body made through the course of the day. This is the beginning of Kenneth Goldsmith’s day on one day in the early 2000: Eyelids open. Tongue runs across upper lip moving from left side of mouth to right following arc of lip. Swallow. Jaws clench. Grind. Stretch. Swallow. Head lifts. Bent right arm brushes pillow into back of head. Arm straightens. Counterclockwise twist thrusts elbow toward ceiling. Tongue leaves interior of mouth, passing through teeth. Tongue slides back into mouth. Palm corkscrews. Thumb stretches. Forefingers wrap.

Event Representations from Cinema and Narrative Fiction

167

Clench. Elbow bends. Thumb moves toward shoulder. Joint of thumb meets biceps. Elbow turns upward as knuckles of fist jam neck. Right hand clenches. Thumb rubs knuckles. Fist to right shoulder. Right elbow thrusts. Knuckles touch side of neck. Hands unfurl. Backs of hands press against flat of neck. Heels of hands push into jaw. Elbows raise. Finger wrap around neck. Thumbs tuck. Hands move toward jaw. Cover ears. Tips of fingers graze side of head. Hairs tickle tips as they pass. Thumbs trail behind fingers. Arms extend. Fingers unfurl…. He hasn’t even gotten out of bed! It just goes on like this. So, this is one person’s attempt to give a non-schematized description of everyday activity. And when you read that, you realize how much normal, mundane, everyday prose selects and shapes and lenses and filters our experience of events. Some of the ways it does this include: omission of lots and lots of boring details; elision—collapsing long intervals of something into a short bit standing for it; feature selection—which features to talk about; changing the point of view to give a view on the scene that’s of interest; and segmentation into meaningful events. We can see that same kind of schematization at work in filmmaking. This is a relatively typical cinematic depiction of an event sequence. This is a film that we made for an experiment in the lab. I deliberately chose it because it is, if anything, less schematizing than a typical commercial film. [Watching a video] It is a sequence of a woman making tea, and it has a few edits. It shows her preparing the tea and taking the tea out and enjoying her cup of tea. When we show participants in the lab a sequence like this, they’re perfectly happy to watch it. They treat it like a perfectly decent if somewhat boring movie of everyday activity. Contrast this with what you might get if you really were to just be a fly on the wall watching that person’s normal life. This is just some of the unedited footage that we use to make that previous sequence. And the first thing you note is that there are these uncomfortable, boring gaps where nothing is happening for a long time. Eventually, the actor and the cameraperson give up and say, “Okay, we can cut this and we can just jump to when the tea is done making”. But notice also that we didn’t get the intercut shot of the close up of the mug, so we don’t get as much information about what’s going on and it’s not segmented as well. So, just like with narrative text, narrative film schematizes by omitting the boring bits, by shortening them, so in this we used a couple of fades, which had become a little bit unusual in filmmaking, but which are a good way of indicating that time has passed. Feature selection allows you to zoom in on the things that you want the viewer to focus on. You can change the point of view

168

Lecture 10

so you can identify with one character or another or take a third person point of view on the scene. And again, it facilitates segmentation into a sequence of meaningful chunks. So narrative text and narrative film use a lot of the same mechanisms to facilitate and shape our event building mechanisms; nonetheless there are some differences between the two. Consider this sequence from the third Harry Potter film, Harry Potter and the Goblet of Fire. Here is a magic tournament where he has to solve a bunch of puzzles and defeat a bunch of antagonists like this dragon, and he is competing with other students while trying to not get singed to a crisp. We can contrast that (the video) with this passage: Harry watched the dragon nearest to them teeter dangerously on its back legs; its jaws were stretched wide in a suddenly silent howl; its nostrils were suddenly devoid of flame, though still smoking—then, very slowly, it fell—several tons of sinewy, scaly black dragon hit the ground with a thud that Harry could have sworn had made the trees behind him quake. These two ways of depicting events give us really quite different information. They select different bits of information about the same event out in the world. So, language is really good at telling people about internal states. That was a pretty perceptually rich linguistic description—an unusual amount of color and shape information. But it also had information about what Harry saw and what he felt. It can give evaluations. You can report that the character is afraid or planning or concerned and you can give explicit temporal and spatial locative, so you can say that time has passed or say that the scene has changed. Whereas in film, obviously the amount of sensory information—visual and auditory sensory detail—is much higher compared to a text. And it’s easy to incorporate a large number of background features whereas it would be really tedious to mention all of these features in a narrative. If we would be going through continuously articulating what was in the background and where the people were and what the rocks looked like in a novel, that would just be a boring novel. And the film can give heavier rendering of point of view than a story because the camera always is located in a particular position. It is always giving you a point of view on the scene; whereas a narrative can remind you of point of view from time to time, but then shift to a mode that provides more coarse-grained information about the point of view. In language, you can tell a lot more about characters’ internal states. In film, you can tell a lot more about the perceptual and sensory world. This is kind of captured in the maxim that the actors are often given, you show and don’t tell. The only way that you can tell an audience about a character’s internal state

Event Representations from Cinema and Narrative Fiction

169

without doing a voiceover is to have the person to depict the emotion on their face or to infer it from the surrounding scene. One of the powerful ways that film in particular allows us to schematize events is by guiding the eyes, and I want to give one example of how film guides the eyes. This comes from my friend and colleague Tim Smith at Birkbeck College, who is well known to some of us. This came from a really interesting event that Tim organized at the Academy of Motion Picture Arts and Sciences, the organization that does the Oscars. At this event they invited several neuroscientists. There’s me. There’s Uri Hasson. There’s Talma Hendler, James Cutting, and Tim, and several film people. This is the actor and director Jon Favreau. This is the legendary film editor and director and sound designer Walter Murch. This is the director Darren Aronofsky and his screenwriting partner Ari Handel. One of the films that Favreau had recently made at that point was an action film called Iron Man II. What Tim did was eye-tracked 10 or 12 of the audience members just before the event while they watched a scene from Iron Man II. This is Jon Favreau looking at the data and talking about what he is seeing. What you’ll see is a clip from the movie and superimposed on it a heat map showing the places where lots of people are looking over time. [Watching a video clip from Iron Man II] I think that demonstrates really clearly that filmmakers have to have great intuitions about where it’s effective to guide the eyes to facilitate the construction of the event model that they want people to build in their heads. Filmmakers have adapted their practices to the structure of our visual systems, auditory systems, and comprehension systems over a period of 120 years so that there is an exquisite tuning. It’s not that our visual systems evolved for film, rather that film has evolved to fit our visual systems and the people who make films had become brilliant intuitive psychologists and neuroscientists. One other technique that I would be remiss if I didn’t mention is montage. Montage is what results when you abut two images together in time. This is a still picture illustration of a famous informal experiment that was conducted by the Russian film scholar and filmmaker Kuleshov. The original film has been lost but this is using images that might have been what he used. What he did is ask an actor to just pose a neutral face and then intercut the neutral face with a hot bowl of soup, then the neutral face, then a child in a coffin, then the neural face, then a beautiful woman, then the neutral face. And when people watched this sequence, they reported how brilliant it was that this actor could convey hunger with just the slightest raise of an eyebrow or adjustment of his mouth, or sadness or love. In fact, the actor wasn’t conveying anything. It was the same shot of the actor, right? It’s all in your mind as your mind knits together the

170

Lecture 10

successive shots. So, that’s another really important way that film can shape our event model construction. To summarize, narrative text and narrative film both depend on our evolved mechanisms for representing events. They differ somewhat in their interface to those mechanisms, but they scaffold event model construction in very similar ways. What I want to argue is that the result is a construction of a sequence of events in our minds that is quite the same whether the source comes from either of these media or from events that we experience in our natural lives. I want to show you one more clip before we take a short break. This is an excerpt from the James Bond film Skyfall. I think it really nicely illustrates the kind of joy of event model construction that media give us. In a film like this, you can have all kinds of experiences that would be way too dangerous to have in real life. I’d love to chase bad guys through the London underground but I would probably get shot in about five seconds. So, I’ll let James Bond do it for me. But the real question I want to ask you before we take our break is how many edits do you think there were in that sequence? The sequence was 36 seconds long. Just take a guess, how many [[People who]] think there were fewer than 5 cuts in that sequence, raise your hands. How about 5 to 10? 10 to 15? 15 to 20? More than 20? Some people are too conservative to make a response. You guys are a little bit better than most of my audiences. But in fact there were exactly twenty cuts in that sequence, and few people thought that they were that many. This is very typical. This is a phenomenon called edit blindness. (By the way, this sequence is quite typical for modern action films. There is an edit faster than once every two seconds. The overall rate of editing across an hour and a half action film these days is usually an edit every four seconds.) Most of the time, we don’t notice most of those edits. In fact, if you ask people to attend to them and to respond whenever there is an edit, they will miss a good proportion of them. What I want to do after we take a little break is come back and talk about the mechanisms that produce that kind of edit blindness. All right! Let’s continue. So why is it that we can see a sequence like that and be so unaware of many of the edits? And more broadly, why is it that film editing can be so invisible? It seems like we can look through it as if it is transparent to the underlying events shown by the film. One of the first people in the history of film to appreciate this was the filmmaker Georges Méliès. He is the protagonist in this Martin Scorsese film that came out a few years ago, called Hugo, which is a wonderful, charming film. I highly recommend it if you haven’t seen it.

Event Representations from Cinema and Narrative Fiction

171

This is an excerpt from one of Méliès’s film. This was made in 1902, and it is one of the earliest cuts in the history of cinema. It’s called A Trip to the Moon. There, that was the cut. Méliès’s first interest in cuts came about because he was a magician. He got into filmmaking from doing stage magic and the reason he got so interested in film was because for him, what magic was about was giving people perceptual experiences that were beyond what they could have in real life, that would be special. And when he realized that you could stop the camera, move stuff around, and then restart the camera causing things to transform instantaneously, he just thought this was the coolest thing ever. Then, he and others quickly realized that we could use it to shape a story, which is what he’s doing here. Filmmakers quickly realized that people were pretty insensitive to cuts. There is a sequence in the film Hugo depicting one of the great legends of early film, which is that when the Lumiere brothers showed audiences their film of a train coming into La Ciotat Station in Paris that the audience ran screaming from the theater because they thought that the looming train was going to run them over. That actually probably never happened, but there are accounts of people noticing how unnatural things like looming and other aspects of film. But you don’t see contemporary writers writing that audiences had a lot of trouble understanding film cuts. And in fact, there has been some recent work showing film for the first time to isolated groups of people who have never seen a screen before, and there is a lot of film editing that they can get without any experience. So, it appears that a lot of the techniques that film editors arrived at are cognitively natural though they have evolved over time. Why are we so insensitive to cuts? At first it seems like a paradox. This is really important. I want to emphasize, until those first cuts of Méliès and his colleagues, it had never been the case in the 400-million-year history of the human visual system that everything in front of an eyeball was instantaneously replaced with something totally different. That was a brand new thing in our evolutionary history. It is in some regards quite surprising that people could cope with this so easily. I want to argue that the reason that it works is because we evolved for vision in the real world, where that kind of instantaneous displacement doesn’t happen—but despite the fact that objects in the real world don’t blink in and out of existence, our visual input is jumpy and interrupted because we move our eyes constantly, we blink, and things get hidden from our view. If I am standing behind this podium, you can’t see my legs, and if I walk out from behind the podium, then you can. Our visual system has to track the fact that things are being hidden and revealed over time; this happens continuously.

172

Lecture 10

Also, an important component that we will talk about in a second is masking. I just flashed three objects and let me ask, was there a red diamond in the three objects that I flashed? How many would say yes? I’ll give you one more shot at it. Was there a green circle? I made it too easy because I clumsily went back and forth, but if I just told you to pay attention and flashed the green circle, most of you would have been able to recognize it. Whereas if I do something like this and I ask you, “Was there a yellow parallelogram?” it is much harder if it’s followed by something that has high contrast and motion. One reason that cuts work is that they’re often followed by motion sequences that mask the disruption of the cut itself. So masking matters. The other thing that really matters is eye movements. Why are eye movements so important? We move our eyes, on average, three to four times every second and most of the movements are jumpy, ballistic movements called saccades. In a saccade what happens is the eye is fixed on one object and it launches to another object, and during the time the eye is moving, the visual information coming into the eyeball is useless because the eye is moving too quickly and our brain has a nifty mechanism to shut off input from the eyeball to the brain at those points. We have to move our eyes because we have a foveated visual system. This is a picture of the sort that one might get at an eye doctor’s; They take these pictures to look at the blood vessels of the retina and to make sure that the retina is doing ok. But what you can see in this kind of picture is that there is a focal spot where there is a much higher concentration of photo receptors, so almost all of the cones in your retina are concentrated there in the fovea. As Jon Favreau said, this part of your visual field just covers up about the width of two thumbs with its arms’ length. It’s tiny. Try this, if you hold out your thumbs and focus your eyes on your thumbs and then try and pay attention to someone who’s sitting off to the side, like 45 degrees off to the side, you’ll notice that you actually have relatively little information about that. It’s quite blurry. Our fovea provides very limited information about the central visual field and what that means is that our eyes have to jump around making these saccadic eye movements in order to track the information that might be of interest. On top of that, we blink every two to three seconds and so between saccades and blinks, if you deduct the time that our brain is basically disconnected from our eyes, it is about a third of our waking lives. I find that totally creepy—that we’re basically functionally blind about a third of our waking lives—but there you have it.

Event Representations from Cinema and Narrative Fiction

173

Okay, so our visual system adapted in a world where the objects weren’t going away, but the objects were being removed from view three times a second or so, due to these eye movements and blinks. The architecture of visual processing reflects these constraints. This is a diagram illustrating the information flow in the visual parts of the brain. What we’re looking at is a brain viewed from the left and a little bit behind. Here’s the cerebellum. Here’s the back of the brain. Here’s the front of the brain. And the primary visual cortex is mostly buried here in the sulcus, right at the back of the occipital lobe. In general, information coming from the eyeballs is relayed through the lateral geniculate nucleus. It first hits the primary visual cortex, and then is relayed forward through a series of synapses basically moving forward and out, inferiorly and superiorly, through a succession of visual processing stages. We can call this feedforward information flow. We can characterize the visual region in terms of how many synapses it is away from the retina. However, we find that, in most cases, each feedforward connection is paired with a massive feedback connection that is usually greater in terms of number of projections than the feedforward connection. So, the brain is talking back to the earlier stages at least as much as the earlier stages are feeding information to the later stages. What we have been thinking is that this bottom-up and top-down information flow is coordinated so as to allow us to construct coherent event models and to update them in the face of a jumpy, interrupted visual world. Within an ongoing event, we want our brains to build coherent event models that bridge the gaps that occur when we move our eyes. We want the event model that we are building to be stable as we’re moving our eyes and blinking as things are becoming occluded. That is important for it to be useful—if an object ceases to be available to our consciousness each time it is occluded or each time I blink, that’s going to be bad. On the other hand, when one meaningful event has ended and another has begun—for example, when we finish this lecture and then we all walk out the door—keeping information about the location of the podium and the chairs and so forth in our brains is no longer going to be helpful, and we would want to update so as to not group together information across those boundaries. In particular, if there are other blue chairs outside or in the next room, we would not want to risk joining together a representation of these blue chairs with those blue chairs. As we saw in Lecture Eight, at this point, we might want to update our events models and not bridge information across those gaps. What would a neural system look like that bridges across visual cohesion gaps like this? What

174

Lecture 10

you would expect to see is that this would be a system that has to work harder when there are film cuts, because it has to do this bridging. But perhaps it wouldn’t have to work hard if there are cuts that correspond to event boundaries, because higher levels of the nervous system might be telling it “Look, you can take a break here. We don’t want to connect the information from before the boundary with the information from after the boundary.” This is a hypothesis about the neural mechanisms that evolved for natural vision that might be coopted to enable film editing. My colleague Joe Magliano and I (2012) tested it with neuroimaging data from this film The Red Balloon. Those who have been here at the previous lectures have heard other analyses of these data, and what I’m going to do is [[to]] tell you about how we used it to look at film editing in the brain. What we did is divide up the five second intervals in this movie into four classes: First, those in which there was no cut. (I should say, The Red Balloon is an older film and it’s edited in a very unique style, so it actually has a much lower frequency of edits than like Skyfall and that’s really helpful for doing these analyses.) So, in a five-second interval, there are a lot of intervals that have no cuts. There are some intervals that have cuts that a filmmaker would call a continuity edit. A continuity edit is an edit that is within a coherent spatiotemporal framework, that is part of a larger meaningful event sequence such that the filmmaker intends for the audience to join information across that cut. These are the cuts that are often invisible to us; these are the cuts that people miss. And then there were edits that were spatiotemporal discontinuities. These are edits where we change spatial location or change time. And then there were edits that filmmakers would call action discontinuities or scene breaks. This is when the protagonist comes to the end of a meaningful event sequence, and you usually have a spatial temporal discontinuity, you cut to a different type of time on location, but you have a new action sequence beginning. What I am plotting here is, when people were asked to make a behavioral judgment, to just tell us whether they perceived that a new meaningful event had begun, experiencing a cut had no effect on their judgment that a new event has begun. They were no more likely to say that it was a new event if there was a cut. In fact, even if the action changes location, so if you cut from one room to another, tracking a set of characters as they are running across Paris, that by itself doesn’t give you a judgment that it’s a new event. But if the characters go from running across Paris to jumping up for a balloon, that action discontinuity makes for a new event. Behaviorally, these are the things that the people think are salient. But these edits are just as much a visual disruption as these, right? When you have a continuity edit, every point in the visual image is changing just as much as for a scene break.

Event Representations from Cinema and Narrative Fiction

175

The question is: What is going on in the brain at those continuity edits? What we wanted to test is the hypothesis that there would be parts of the brain that were involved in bridging visual information such that, compared to cases where there is no cut, they would be more active when there is a continuity edit and you have to bridge. But they would not be more active for these edits where there’s no need to bridge because a new event has begun. So, this is a very specific prediction of a non-monotonic relationship, where you’d have an increase here [[for continuity edits]] and then a decrease back to baseline for these cuts [[spatiotemporal discontinuities]]. What I am showing here is a plot of what lights up in your brain when you see a movie edit in a film. We are looking here at a medial view of the cortex. Unsurprisingly, primary visual cortex is highly activated when you see an event boundary. This makes sense because the cells in our visual system are being stimulated by the change associated with the edit. You also see activity in these higher-level visual areas at the same points, and we hypothesized that the early visual areas would be activated for all the cuts whether there are event boundaries or not. But perhaps some of these higher level visual areas—basically ringing the primary visual cortex—maybe those will be areas that are doing the extra work to bridge across the gaps, but don’t have to do so when those edits correspond to event boundaries. Here in red, superimposed on the previous activity, are areas that are selectively activated within scene cuts but are not activated at cuts that correspond to event boundaries. As predicted, the early visual cortex doesn’t show this pattern, but these higher-level visual areas show a pattern of being more activated within scene cuts—the very kind that are invisible—than between scene cuts. For these parts of the brain, as we are watching the very parts of the movie editing, that our conscious awareness is missing, we think that what they are doing is scrambling around behind the scenes in order to knit together successive views. They do this not because it’s film. They do this because this is what they do all the time when we’re living our real lives. So, cuts leverage the mechanisms that we evolved to deal with a jumpy, patchy visual world and we stitch together successive glimpses into coherent event models. And this depends on top-down regulation governing those higher level visual areas. More broadly, what I want to say is that when we go into the theater or watch something on a streaming service, we bring the brains that we evolved for getting around in the real world. People didn’t evolve for movies or for stories; movies evolved for people, so well that a patch light dancing on a screen can tell us a tremendous amount about how our brains process real life events.

176

Lecture 10

More broadly, here is what I hope we have accomplished this week. I have tried to establish the significant place of events within language and cognition; to articulate how events are individuated, updated, stored and retrieved in perception and language; and to describe two domains in which event cognition has promising applications. This morning we talked about the application to aging and clinical disorders such as Alzheimer’s disease and this afternoon we’ve talked about the application of event cognition to media. I would like to conclude there, but I first would like to thank once again Professor Li and the group for being such gracious hosts. I would like to thank my collaborators, some of whom are named here, and the funders who supported this research over the years. And with that I’ll wrap up. Thanks. Okay, that’s an event boundary!

Bibliography Altmann, G., & Kamide, Y. 2007. The real-time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language 57(4), 502–518. Baddeley, A. 2000. The episodic buffer: A new component of working memory? Trends in Cognitive Sciences 4(11), 417–423. Bailey, H. R., Kurby, C. A., Giovannetti, T., & Zacks, J. M. 2013. Action perception predicts action performance. Neuropsychologia 51(11), 2294–2304. Bailey, H. R., Kurby, C. A., Sargent, J. Q., & Zacks, J. M. 2017. Attentional focus affects how events are segmented and updated in narrative reading. Memory & Cognition 45(6), 940–955. Bailey, H. R., Sargent, J. Q., Flores, S., Nowotny, P., Goate, A., & Zacks, J. M. 2015. APOE ε4 genotype predicts memory for everyday activities. Aging, Neuropsychology, and Cognition 22(6), 639–666. Bailey, H. R., Zacks, J. M., Hambrick, D. Z., Zacks, R. T., Head, D., Kurby, C. A., & Sargent, J. Q. 2013. Medial temporal lobe volume predicts elders’ everyday memory. Psychological Science 24(7), 1113–1122. Baldassano, C., Chen, J., Zadbood, A., Pillow, J. W., Hasson, U., & Norman, K. A. 2017. Discovering event structure in continuous narrative perception and memory. Neuron 95(3), 709–721. Bar, M., Kassam, K. S., Ghuman, A. S., Boshyan, J., Schmid, A. M., Dale, A. M., … Halgren, E. 2006. Top-down facilitation of visual recognition. Proceedings of the National Academy of Sciences of the United States of America 103(2), 449–454. Barker, R. G. & Wright, H. F. 1966. One Boy’s Day: A Specimen Record of Behavior. Hamden, CT: Archon Books. Barker, R. G. & Wright, H. F. 1954. Midwest and Its Children: The Psychological Ecology of An American Town. Evanston, Illinois: Row, Peterson and Company. Barsalou, L. W. 2008. Grounded cognition. Annual Review of Psychology 59, 617–645. Ben-Yakov, A., Eshel, N. & Dudai, Y. 2013. Hippocampal immediate poststimulus activity in the encoding of consecutive naturalistic episodes. Journal of Experimental Psychology: General 142(4), 1255. Benton, A. L. 1962. The visual retention test as a constructional praxis task. Stereotactic and Functional Neurosurgery 22(2), 141–155. Biederman, I. 1987. Recognition-by-components: A theory of human image understanding. Psychological Review 94(2), 115–117. Bower, G. H., Black, J. B. & Turner, T. J. 1979. Scripts in memory for text. Cognitive Psychology 11, 177–220.

178

Bibliography

Boynton, G. M., Engel, S. A., Glover, G. H. & Heeger, D. J. 1996. Linear systems analysis of functional magnetic resonance imaging in human V1. Journal of Neuroscience 16(13), 4207–4221. Bransford, J. D., Barclay, J. R. & Franks, J. J. 1972. Sentence memory: A constructive versus interpretive approach. Cognitive Psychology 3(2), 193–209. Bransford, J. D. & Johnson, M. K. 1972. Contextual prerequisites for understanding: Some investigations of comprehension and recall. Journal of Verbal Learning & Verbal Behavior 11(6), 717–726. Butler, A. C., Zaromb, F. M., Lyle, K. B. & Roediger, I. 2009. Using popular films to enhance classroom learning: the good, the bad, and the interesting. Psychological Science 20(9), 1161–1168. Campbell, K. L., Hasher, L., & Thomas, R. C. 2010. Hyper-binding: a unique age effect. Psychological Science 21(3), 399–405. Castiello, U. 2005. The neuroscience of grasping. Nature Reviews Neuroscience 6(9), 726–736. Chernow, R. 2017. Grant. Penguin. Danto, A. 1963. What we can do. Journal of Philosophy 60, 435–445. Defina, R. 2016. Do serial verb constructions describe single events?: A study of cospeech gestures in Avatime. Language 92(4), 890–910. Defina, R. 2016. Events in Language and Thought: The Case of Serial Verb Constructions in Avatime. Radboud University Nijmegen. Dougherty, R. F., Koch, V. M., Brewer, A. A., Fischer, B., Modersitzki, J., & Wandell, B. A. 2003. Visual field representations and locations of visual areas V1/2/3 in human visual cortex. Journal of Vision 3(10), 1. Eisenberg, M. L. & Zacks, J. M. Under review. Predictive looking is impaired at event boundaries. Epstein, R., Stanley, D., Harris, A. & Kanwisher, N. 2000. The parahippocampal place area: perception, encoding, or memory retrieval. Neuron 23(1), 115–125. Ericsson, K. A., Chase, W. G. & Faloon, S. 1980. Acquisition of a memory skill. Science 208(4448), 1181–1182. Ericsson, K. A. & Kintsch, W. 1995. Long-term working memory. Psychological Review 102(2), 211–245. Ezzyat, Y. & Davachi, L. 2011. What constitutes an episode in episodic memory? Psychological Science 22(2), 243–252. Ferstl, E. C., Neumann, J., Bogler, C. & Von Cramon, D. Y. 2008. The extended language network: a meta‐analysis of neuroimaging studies on text comprehension. Human Brain Mapping 29(5), 581–593. Flores, S., Bailey, H. R., Eisenberg, M. L. & Zacks, J. M. 2017. Event segmentation improves event memory up to one month later. Journal of Experimental Psychology: Learning, Memory, and Cognition 43(8), 1183–1202.

Bibliography

179

Franklin, N., Tversky, B. & Coon, V. 1992. Switching points of view in spatial mental models. Memory and Cognition 20, 507–518. Gernsbacher, M. A. 1990. Language Comprehension as Structure Building. Psychology Press. Gibson, J. J. 1979. The Ecological Approach to Visual Perception. Boston: Houghton Mifflin. Gold, D. A., Zacks, J. M. & Flores, S. 2016. Effects of cues to event segmentation on subsequent memory. Cognitive Research: Principles and Implications, 2. Goldsmith, K. 1994. Fidget. Coach House Books. Grober, E., Buschke, H. C. H. B. S., Crystal, H., Bang, S. & Dresner, R. 1988. Screening for dementia by memory testing. Neurology 38(6), 900. Hafri, A., Papafragou, A. & Trueswell, J. C. 2012. Getting the gist of events: Recognition of two-participant actions from brief displays. Journal of Experimental Psychology: General 142(3), 880–905. Hard, B. M., Recchia, G. & Tversky, B. 2011. The shape of action. Journal of Experimental Psychology: General 140(4), 586–604. Hasson, U., Nir, Y., Levy, I., Fuhrmann, G. & Malach, R. 2004. Intersubject synchronization of cortical activity during natural vision. Science 303(5664), 1634–1640. Hayhoe, M., & Ballard, D. 2005. Eye movements in natural behavior. Trends in Cognitive Sciences 9(4), 188–194. Hollingworth, A. & Henderson, J. M. 1998. Does consistent scene context facilitate object perception? Journal of Experimental Psychology: General 127(4), 398–415. James, W. 1890. The principles of psychology. Dover Books on Philosophy & Psychology 2821(1923), 761. Johnson, D. K., Storandt, M. & Balota, D. A. 2003. Discourse analysis of logical memory recall in normal aging and in dementia of the Alzheimer type. Neuropsychology 17(1), 82–92. Jung-Beeman, M. 2005. Bilateral brain processes for comprehending natural language. Trends in Cognitive Sciences 9(11), 512–518. Kintsch, W., Welsch, D., Schmalhofer, F. & Zimny, S. 1990. Sentence memory: A theoretical analysis. Journal of Memory and Language 29(2), 133–159. Knoeferle, P., Habets, B., Crocker, M. & Munte, T. 2008. Visual scenes trigger immediate syntactic reanalysis: Evidence from ERPs during situated spoken comprehension. Cerebral Cortex 18(4), 789–795. Kurby, C. A. & Zacks, J. M. 2008. Segmentation in the perception and memory of events. Trends in Cognitive Sciences 12(2), 72–79. Kurby, C. A. & Zacks, J. M. 2011. Age differences in the perception of hierarchical structure in events. Memory & Cognition 39(1), 75–91. Kurby, C. A. & Zacks, J. M. 2012. Starting from scratch and building brick by brick in comprehension. Memory & Cognition 40(5), 812–826.

180

Bibliography

Kurby, C. A. & Zacks, J. M. 2013. The activation of modality-specific representations during discourse processing. Brain and Language 126(3), 338–349. Kurby, C. A. & Zacks, J. M. In press. Preserved neural event segmentation in healthy older adults. Psychology & Aging. Kutas, M. & Hillyard, S. A. 1980. Reading senseless sentences: Brain potentials reflect semantic incongruity. Science 207(4427), 203–205. Levy, B. & Langer, E. 1994. Aging free from negative stereotypes: Successful memory in China among the American deaf. Journal of Personality and Social Psychology 66(6), 989–997. Magliano, J., Kopp, K., McNerney, M. W., Radvansky, G. A. & Zacks, J. M. 2012. Aging and perceived event structure as a function of modality. Aging, Neuropsychology, and Cognition 19(1–2), 264–282. Magliano, J. P. & Zacks, J. M. 2011. The impact of continuity editing in narrative film on event segmentation. Cognitive Science 35(8), 1489–1517. Maguire, M. J., Brumberg, J., Ennis, M., & Shipley, T. F. 2011. Similarities in object and event segmentation: A geometric approach to event path segmentation. Spatial Cognition & Computation 11(3), 254–279. Marr, D., & Nishihara, H. K. 1978. Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society, London, B, 200, 269–294. McDaniel, M. A., Schmalhofer, F., & Keefe, D. E. 2001. What is minimal about predictive inferences? Psychonomic Bulletin & Review 8, 840–846. McKoon, G., & Ratcliff, R. 1992. Inference during reading. Psychological Review 99(3), 440–466. Morrow, D. G., Bower, G. H., & Greenspan, S. L. 1989. Updating situation models during narrative comprehension. Journal of Memory and Language 28(3), 292–312. Morrow, D. G., Greenspan, S. L., & Bower, G. H. 1987. Accessibility and situation models in narrative comprehension. Journal of Memory and Language 26(2), 165–187. Munsterberg, H. 1916. The Photoplay: A Psychological Study (p. 21). New York/London: D. Appleton. Newtson, D. & Engquist, G. 1976. The perceptual organization of ongoing behavior. Journal of Experimental Social Psychology 12, 436–450. Osterhout, L. & Holcomb, P. J. 1992. Event-related brain potentials elicited by syntactic anomaly. Journal of Memory and Language 31(6), 785–806. Park, D. C., Lautenschlager, G., Hedden, T., Davidson, N. S., Smith, A. D., & Smith, P. K. 2002. Models of visuospatial and verbal memory across the adult life span. Psychology and Aging 17(2), 299. Pérez, A., Cain, K., Castellanos, M. C. & Bajo, T. 2015. Inferential revision in narrative texts: An ERP study. Memory & Cognition 43(8), 1105–1135. Perry, J. & Barwise, J. 1983. Situations and Attitudes. MIT Press.

Bibliography

181

Pettijohn, K. A. & Radvansky, G. A. 2016. Narrative event boundaries, reading times, and expectation. Memory & Cognition 44(7), 1064–1075. Pylyshyn, Z. W. 1973. What the mind’s eye tells the mind’s brain: A critique of mental imagery. Psychological Bulletin 80(1), 1. Radvansky, G. A., Andrea, E. O. & Fisher, J. S. 2017. Event models and the fan effect. Memory & Cognition, 1–17. Radvansky, G. & Copeland, D. 2006. Walking through doorways causes forgetting: situation models and experienced space. Memory & Cognition 34(5), 1150–1156. Radvansky, G. A., Krawietz, S. & Tamplin, A. 2011. Walking through doorways causes forgetting: Further explorations. The Quarterly Journal of Experimental Psychology, 1–14. Radvansky, G. A. & Zacks, J. M. 2014. Event Cognition. New York: Oxford University Press. Radvansky, G. A., Zacks, R. T. & Hasher, L. 1996. Fact retrieval in younger and older adults: The role of mental models. Psychology and Aging 11(2), 258–271. Rey, A. 1964. L’exconen Clinique en Psychologie. Paris: Presses Universitaires de France. Reynolds, J. R., Zacks, J. M. & Braver, T. S. 2007. A computational model of event segmentation from perceptual prediction. Cognitive Science 31, 613–643. Rinck, M. & Bower, G. 2000. Temporal and spatial distance in situation models. Memory and Cognition 28(8), 1310–1320. Rinck, M. & Weber, U. 2003. Who when where: An experimental test of the eventindexing model. Memory and Cognition 31(8), 1284–1292. Robertson, D. A., Gernsbacher, M. A., Guidotti, S. J., Robertson, R. R., Irwin, W., Mock, B. J. & Campana, M. E. 2000. Functional neuroanatomy of the cognitive process of mapping during discourse comprehension. Psychological Science 11(3), 255–260. Rosch, E., Mervis, C., Gray, W. D., Johnson, D. M. & Boyes-Braem, P. 1976. Basic objects in natural categories. Cognitive Psychology 8(3), 382–439. Rosen, V., Caplan, L., Sheesley, L., Rodriguez, R., & Grafman, J. 2003. An examination of daily activities and their scripts across the adult lifespan. Behavioral Research Methods & Computers 35(1), 32–48. Rowling, J. K. & Potter, H. 2000. The Goblet of Fire. London: Bloomsbury. Sargent, J. Q., Zacks, J. M., Hambrick, D. Z., Zacks, R. T., Kurby, C. A., Bailey, H. R. … & Beck, T. M. 2013. Event segmentation ability uniquely predicts event memory. Cognition 129(2), 241–255. Schwan, S., Garsoffky, B. & Hesse, F. W. 2000. Do film cuts facilitate the perceptual and cognitive organization of activity sequences? Memory & Cognition 28(2), 214–223. Schwartz, M. F., Segal, M., Veramonti, T., Ferraro, M. & Buxbaum, L. J. 2002. The Naturalistic Action Test: A standardised assessment for everyday action impairment. Neuropsychological Rehabilitation 12(4), 311–339.

182

Bibliography

Shelton, A. L. & Gabrieli, J. D. 2002. Neural correlates of encoding space from route and survey perspectives. Journal of Neuroscience 22(7), 2711–2717. Shepard, R. N., Cooper, L. A., et al. 1982. Mental Images and Their Transformations. Cambridge, MA: MIT Press. Shimamura, A. P., Berry, J. M., Mangels, J. A., Rusting, C. L. & Jurica, P. J. 1995. Memory and cognitive abilities in university professors: Evidence for successful aging. Psychological Science, 271–277. Shipley, T. F. 2008. An Invitation to an Event (Vol. Understanding Events: From Perception to Action, pp. 3–30). New York: Oxford University Press. Speer, N. K., Reynolds, J. R., Swallow, K. M. & Zacks, J. M. 2009. Reading stories activates neural representations of perceptual and motor experiences. Psychological Science 20, 989–999. Speer, N. K. & Zacks, J. M. 2005. Temporal changes as event boundaries: Processing and memory consequences of narrative time shifts. Journal of Memory and Language 53, 125–140. Speer, N. K., Zacks, J. M. & Reynolds, J. R. 2007. Human brain activity time-locked to narrative event boundaries. Psychological Science 18(5), 449–455. Sutton, S., Braren, M., Zubin, J. & John, E. R. 1965. Evoked-Potential Correlates of Stimulus Uncertainty. Science 150(3700), 1187–1188. Swallow, K. M., Barch, D. M., Head, D., Maley, C. J., Holder, D., & Zacks, J. M. 2011. Changes in events alter how people remember recent information. Journal of Cognitive Neuroscience 23(5), 1052–1064. Swallow, K. M., Zacks, J. M. & Abrams, R. A. 2009. Event boundaries in perception affect memory encoding and updating. Journal of Experimental Psychology: General 138(2), 236–257. Talmy, L. 1988. Force dynamics in language and cognition. Cognitive Science 12(1), 49–100. Thompson, W. 2010. China’s rapidly aging population. Today’s Research on Aging. Program and Policy Implications 20, 1–5. Tootell, R., Silverman, M., Switkes, E. & De Valois, R. 1982. Deoxyglucose analysis of retinotopic organization in primate striate cortex. Science 218(4575), 902–904. van Berkum, J. J. A., Hagoort, P. & Brown, C. M. 1999. Semantic integration in sentences and discourse: Evidence from the N400. Journal of Cognitive Neuroscience 11(6), 657–671. van Dijk, T. A. & Kintsch, W. 1983. Strategies of Discourse Comprehension. New York: Academic Press. von Helmholtz, H. 1910. Handbuch der Physiologischen Optik. Translation by J. P. C. Southall (1925) Helmholtz’s treatise on physiological optics, vol. 3. Wechsler, D. & Stone, C. P. 1973. Instruction Manual for the Wechsler Memory Scale. New York, NY: The Psychological Corporation.

Bibliography

183

Whitney, C., Huber, W., Klann, J., Weis, S., Krach, S. & Kircher, T. 2009. Neural correlates of narrative shifts during auditory story comprehension. Neuroimage 47(1), 360–366. Williams, H. L., Conway, M. A. & Baddeley, A. D. 2008. The boundaries of episodic memories. Understanding Events: From Perception to Action, 589–616. Wolff, P., Jeon, G. -H. & Li, Y. 2009. Causers in English, Korean, and Chinese and the individuation of events. Language and Cognition 1(2), 167–196. Yarkoni, T., Speer, N. K. & Zacks, J. M. 2008. Neural substrates of narrative comprehension and memory. Neuroimage 41(4), 1408–1425. Zacks, J. M. 2004. Using movement and intentions to understand simple events. Cognitive Science 28(6), 979–1008. Zacks, J. M. & Ferstl, E. C. 2015. Discourse Comprehension. In G. Hickok & S. L. Small (Eds.), Neurobiology of Language (pp. 662–674). Amsterdam: Elsevier Science Publishers. Zacks, J. M., Braver, T. S., Sheridan, M. A., Donaldson, D. I., Snyder, A. Z., Ollinger, J. M., … Raichle, M. E. 2001. Human brain activity time-locked to perceptual event boundaries. Nature Neuroscience 4(6), 651–655. Zacks, J. M., Kumar, S., Abrams, R. A. & Mehta, R. 2009. Using movement and intentions to understand human activity. Cognition 112, 201–216. Zacks, J. M., Kurby, C. A., Eisenberg, M. L. & Haroutunian, N. 2011. Prediction error associated with the perceptual segmentation of naturalistic events. Journal of Cognitive Neuroscience 23, 4057–4066. Zacks, J. M., Kurby, C. A., Landazabal, C. S., Krueger, F. & Grafman, J. 2016. Effects of penetrating traumatic brain injury on event segmentation and memory. Cortex 74, 233–246. Zacks, J. M., Speer, N. K. & Reynolds, J. R. 2009. Segmentation in reading and film comprehension. Journal of Experimental Psychology: General 138(2), 307–327. Zacks, J. M., Speer, N. K., Swallow, K. M., Braver, T. S. & Reynolds, J. R. 2007. Event perception: a mind-brain perspective. Psychological Bulletin 133(2), 273. Zacks, J. M., Speer, N. K., Swallow, K. M. & Maley, C. J. 2010. The brain’s cutting-room floor: Segmentation of narrative cinema. Frontiers in Human Neuroscience 4(168), 1–15. Zacks, J. M., Speer, N. K., Vettel, J. M. & Jacoby, L. L. 2006. Event understanding and memory in healthy aging and dementia of the Alzheimer type. Psychology and Aging 21(3), 466–482. Zacks, J. M. & Swallow, K. M. 2007. Event segmentation. Current Directions in Psychological Science 16, 80–80–84(5). Zacks, J. M. & Tversky, B. 2001. Event structure in perception and conception. Psychological Bulletin 127(1), 3–21.

184

Bibliography

Zacks, J. M., Tversky, B. & Iyer, G. 2001. Perceiving, remembering, and communicating structure in events. Journal of Experimental Psychology: General 130(1), 29–58. Zadbood, A., Chen, J., Leong, Y. C., Norman, K. A. & Hasson, U. 2017. How we transmit memories to other brains: constructing shared neural representations via communication. BioRxiv, 081208. Zheng, Y., Markson, L. & Zacks, J. M. In preparation. The development of event perception and memory. Zwaan, R. A. 1996. Processing narrative time shifts. Journal of Experimental Psychology: Learning, Memory, & Cognition 22(5), 1196–1207. Zwaan, R. A., Magliano, J. P. & Graesser, A. C. 1995. Narrative Comprehension. Learning, Memory 21(2), 386–3. Zwaan, R. A., Radvansky, G. A., Hilliard, A. E. & Curiel, J. M. 1998. Constructing multidimensional situation models during reading. Scientific Studies of Reading 2(3), 199–220.

About the Series Editor Fuyin (Thomas) Li (born 1963, Ph.D. 2002) received his Ph.D. in English Linguistics and Applied Linguistics from the Chinese University of Hong Kong. He is professor of linguistics at Beihang University, where he has organized the China International Forum on Cognitive Linguistics (cifcl.buaa.edu.cn) since 2004. As the founding editor of the journal Cognitive Semantics (brill.com/ cose), the founding editor of International Journal of Cognitive Linguistics, editor of the series Distinguished Lectures in Cognitive Linguistics (brill.com/dlcl; originally Eminent Linguists’ Lecture Series), editor of Compendium of Cognitive Linguistics Research, and organizer of ICLC-11, he plays a significant role in the international expansion of Cognitive Linguistics. His main research interests involve Talmyan cognitive semantics, the overlapping systems model, event grammar, causality, etc. with a focus on synchronic and diachronic perspectives on Chinese data, and a strong commitment to usage-based models and corpus method. His representative publications include the following: Metaphor, Image, and Image Schemas in Second Language Pedagogy (2009), Semantics: A Course Book (1999), An Introduction to Cognitive Linguistics (in Chinese, 2008), Semantics: An Introduction (in Chinese, 2007), Toward a Cognitive Semantics, Volume Ⅰ: Concept Structuring Systems (Chinese version, 2017), Toward a Cognitive Semantics, Volume Ⅱ: Typology and Process in Concept Structuring (Chinese version, 2019), both volumes were originally published in English, written by Leonard Talmy (MIT, 2000). His personal homepage: http://shi.buaa.edu.cn/thomasli E-mail: [email protected]; [email protected]

Websites for Cognitive Linguistics and CIFCL Speakers All the websites were checked for validity on 20 January 2019.



Part 1 Websites for Cognitive Linguistics

1. http://www.cogling.org/ Website for the International Cognitive Linguistics Association, ICLA 2. http://www.cognitivelinguistics.org/en/journal Website for the journal edited by ICLA, Cognitive Linguistics 3. http://cifcl.buaa.edu.cn/ Website for China International Forum on Cognitive Linguistics (CIFCL) 4. http://cosebrill.edmgr.com/ Website for the journal Cognitive Semantics (ISSN 2352–6408 / E-ISSN 2352–6416), edited by CIFCL 5. http://www.degruyter.com/view/serial/16078?rskey=fw6Q2O&result=1&q=CLR Website for the Cognitive Linguistics Research [CLR] 6. http://www.degruyter.com/view/serial/20568?rskey=dddL3r&result=1&q=ACL Website for Application of Cognitive Linguistics [ACL] 7. http://www.benjamins.com/#catalog/books/clscc/main Website for book series in Cognitive Linguistics by Benjamins 8. http://www.brill.com/cn/products/series/distinguished-lectures-cognitive -linguistics Website for Distinguished Lectures in Cognitive Linguistics (DLCL) 9. http://refworks.reference-global.com/ Website for online resources for Cognitive Linguistics Bibliography 10. http://benjamins.com/online/met/ Website for Bibliography of Metaphor and Metonymy

Websites for Cognitive Linguistics and CIFCL Speakers

187

11. http://linguistics.berkeley.edu/research/cognitive/ Website for the Cognitive Linguistics Program at UC Berkeley 12. https://framenet.icsi.berkeley.edu/fndrupal/ Website for Framenet 13. http://www.mpi.nl/ Website for the Max Planck Institute for Psycholinguistics



Part 2 Websites for CIFCL Speakers and Their Research

14. CIFCL Organizer Thomas Li, [email protected]; [email protected] Personal homepage: http://shi.buaa.edu.cn/thomasli http://shi.buaa.edu.cn/lifuyin/en/index.htm 15. CIFCL 18, 2018 Arie Verhagen, [email protected] http://www.arieverhagen.nl/ 16.

CIFCL 17, 2017 Jeffrey M. Zacks, [email protected] Lab: dcl.wustl.edu Personal site: https://dcl.wustl.edu/affiliates/jeff-zacks/

17. CIFCL 16, 2016 Cliff Goddard, [email protected] https://www.griffith.edu.au/griffith-centre-social-cultural-research/our-centre/ cliff-goddard 18.

CIFCL 15, 2016 Nikolas Gisborne, [email protected]

19.

CIFCL 14, 2014 Phillip Wolff, [email protected]

20. CIFCL 13, 2013 (CIFCL 03, 2006) Ronald W. Langacker, [email protected] http://idiom.ucsd.edu/~rwl/

188

Websites for Cognitive Linguistics and CIFCL Speakers

21. CIFCL 12, 2013 (CIFCL 18, 2018) Stefan Th. Gries, [email protected] http://www.stgries.info 22. CIFCL 12, 2013 Alan Cienki, [email protected] https://research.vu.nl/en/persons/alan-cienki 23. CIFCL 11, 2012 Sherman Wilcox, [email protected] http://www.unm.edu/~wilcox 24. CIFCL 10, 2012 Jürgen Bohnemeyer, [email protected] Personal homepage: http://www.acsu.buffalo.edu/~jb77/ The CAL blog: https://causalityacrosslanguages.wordpress.com/ The blog of the UB Semantic Typology Lab: https://ubstlab.wordpress.com/ 25. CIFCL 09, 2011 Laura A. Janda, [email protected] http://ansatte.uit.no/laura.janda/ 26. CIFCL 09, 2011 Ewa Dabrowska, [email protected] 27. CIFCL 08, 2010 William Croft, [email protected] http://www.unm.edu/~wcroft 28. CIFCL 08, 2010 Zoltán Kövecses, [email protected] 29. CIFCL 08, 2010 (Melissa Bowerman: 1942–2014) 30. CIFCL 07, 2009 Dirk Geeraerts, [email protected] http://wwwling.arts.kuleuven.be/qlvl/dirkg.htm

Websites for Cognitive Linguistics and CIFCL Speakers 31.

CIFCL 07, 2009 Mark Turner, [email protected]

32.

CIFCL 06, 2008 Chris Sinha, [email protected]

33.

CIFCL 05, 2008 Gilles Fauconnier, [email protected]

34. CIFCL 04, 2007 Leonard Talmy, [email protected] https://www.acsu.buffalo.edu/~talmy/talmy.html 35. CIFCL 03, 2006 (CIFCL 13, 2013) Ronald W. Langacker, [email protected] http://idiom.ucsd.edu/~rwl/ 36. CIFCL 02, 2005 John Taylor, [email protected] https://independent.academia.edu/JohnRTaylor 37. CIFCL 01, 2004 George Lakoff, [email protected] http://georgelakoff.com/

189